These slides are at http www macs hw

  • Slides: 22
Download presentation
These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Advanced Database Systems F

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Advanced Database Systems F 24 DS 2 / F 29 AT 2 Data Quality and Data Cleaning 2

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Acknowledgements I adapted this

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Acknowledgements I adapted this material from various sources, most notably: • A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs • A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What Keeps DBs Dirty

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ … all this: •

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ … all this: • Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB – My postcode problem • The data are just too numerous, complex and illunderstood. `Cleaning it’ would cost too much! • Undetectable problems: incorrect values, missing entries • Metadata not maintained properly

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source vs Multiple

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source / Schema

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source / Schema level examples Scope Problem Unclean Notes attribute Illegal values Do. B=30. 02. 88 Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Source Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no. s should be unique Referential integrity violation Name= D Corne, Office = EM G. 92 Where is G. 92 ?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source / Instance

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Single Source / Instance level examples Scope Problem Unclean Notes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX 6 Record type Word transposition, Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, Do. B = 17/12/62 Name = Jo Smith, Do. B = 17/11/62 Source Wrong references Duplicates, contradictions Title = Dark Side of the Moan Full. Name = J. Smith Colour = 160 mph Phone = “Dan Jones 0207 308653” Name= D Corne, Office = EM G. 46 exists, but is not my office.

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Multiple Source Problems/ Instance

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID Name Street City Sex 102 Luke Smith 5 Chewie Rd Dawlish, Devon 0 175 Leia Smith Chewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID Last. Name Other names Phone Gender 23 Smith Luke Michael 378988 Male 35 Smith Leia S. +44(0)1626 378988 F

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ When Companies A and

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person? ? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Special but Common

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Special but Common type of Problem: Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E. g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person. The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Semantic Complexity: Missing/Default Values

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E. g. Suppose the histogram of value types in mobile phone no. field is:

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What does NULL mean?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, 0000000 for type B and 9999999 for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Data Cleaning: Phases in

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Data Cleaning: Phases in DC: Analysis: to detect errors and inconsistencies in the DB needs detailed analysis, involving both manual inspection and automated analysis programs. This reveals where (most of) the problems are. Defining transformation and mapping rules: Having found the problems, this next phase is concerned with defining the way you are going to automate solutions to clean the data

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Data Cleaning: phases continued

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Data Cleaning: phases continued Verification: In this phase we test and evaluate the transformation plans we made in stage 2; without this, we may end up making the data dirtier rather than cleaner. Transformation: Do the transformation, now that you’re sure it will be done correctly. Backflow of cleaned data: Do what we can to ensure that cleaned data percolates to various repositories that may still harbour errors.

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Data

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Data Analysis Data Profiling: examine the instances to see how the attributes vary. E. g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What problems does this

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What problems does this analysis alert us to?

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Data

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Data Mining is simply about more advanced forms of data analysis. We talk about that next week.

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Defining

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Defining Data Transformation Rules As a result of the analysis phase, you will find various problems that translate to a list of actions, such as: – Remove all entries for J. Smith (duplicates of John Smith) – Find entries with `bule’ in colour field and change these to `blue’. – Output a list of all records where the Phone number field does not match the pattern (NNNNNN) (further steps required to then cleanse these data) – Find all entries where the Name field contains a potential Do. B string, and the Do. B field is NULL, and then repair these entries. – Etc …

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Verification

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Verification This speaks for itself! Data transformation is the main step that actually changes the data itself – so you need to be sure you will do it correctly. So, test and examine the transformation plans very carefully. It is easy to mess the data up even more if you have a faulty transformation plan. – I have a very thick C++ book where it says strict in all the places where it should say struct

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Transformation

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Transformation Go ahead and do it. For large DBs, this task is supported by a variety of tools (as also is data analysis, often in the same tool). The list is growing. E. g. DATACLEANSER is a specialist tool for identifying and eliminating duplicates. TRILLIUM focuses on cleaning name/address data. Such tools use a huge built-in library of rules for dealing with the common problems. Alternatively or additionally you can write your own code for specialised bits of cleaning (and then verify it!).

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Backflow

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Phases in DC: Backflow Once the `master’ source of data – perhaps a newly integrated DB, is `cleaned’, there is the opportunity to fix errors that may have spread beyond the DB before it was cleaned. This will be a very different and varied process in every case, and the results of the first Analysis stage should start to provide clues about what could be done here. Examples of such backflow can vary between: • Refunding 1 customer 12 p because he was mischarged for postage owing to a faulty postcode entry • Removing £ 1, 000, 000’s worth of a brand of olive oil from supermarket shelves across Europe, since a DB (and hence the label) did not correctly indicate that it contains something dangerous to those with nut allergies.

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What this lecture was

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What this lecture was about • Why DBs are almost always not `clean’ • A single source/multi-source and instance level/schema level classification of errors • Semantic Complexity • Five Phases in a corporate Data Cleaning process