These slides are at http www macs hw

  • Slides: 29
Download presentation
These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Advanced Database Systems F

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Advanced Database Systems F 24 DS 2 / F 29 AT 2 Data Quality and Data Cleaning 1 David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Acknowledgements I adapted this

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Acknowledgements I adapted this material from various sources, most notably: • A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs • A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ On Data Quality Suppose

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database? ’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning … David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? Barratt Burns Carter Davies John Robert Laura Michael 22 24 20 12 Maths CS Physics CS BSc MSc BSc Male Female Male David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E. g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC Licensing DB should contain an entry for every address in the country. Does it? David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? Surname Firstname Do. B Driving test passed: Smith J. Jack Jock 17/12/85 17/12/95 17/12/05 17/12/2005 David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office has just changed my postcode from EX 6 8 RA to EX 6 8 NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date? ? David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? Date 23 rd Nov 24 th Nov 25 th Nov Sales £ 25, 609 £ 26, 202 £ 28, 936 Returns £ 1, 003 £ 1, 601 £ 1, 178 Net income £ 24, 506 £ 24, 601 £ 25, 758 David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The Several Problems with

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The Several Problems with the Conventional (or any? ) Definition Fine, we can define Data Quality (DQ) in a way that makes it clear what kinds of issues we are thinking about. But is the definition any use beyond that? Can it be used: • to Measure DQ, so that we can say whether or not one DB is better than another, or so we can gauge improvements in DQ over time. • on a wide range of different DBs, of widely different purposes, sizes, etc. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ No: Four problems There

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ No: Four problems There at least four specific problems with this definition if we are going to use it for any other purpose than just discussing DQ issues: Measurability, rigidity, context independence, unhelpfulness David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: measurability

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: measurability It is not clear how we might measure quality according to some of these items. Completeness: How will we know? ? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: rigidity

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: rigidity The conventional definition is rigid and static. It is easy to think of two DBs that are high quality according to this definition, but are nevertheless of entirely different quality. E. g. one may contain much more potentially relevant information. One may have better metadata. Etc… David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: context

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: context independence The conventional definition does not take into account the context of the DB. Different things may be important for quality in different circumstances. E. g. a DB for public access must be highly accessible and interpretable; in a DB of experimental results, accuracy is more important than consistency (why? ), etc. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: unhelpfulness/vagueness

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Def problems: unhelpfulness/vagueness Partly as a result of the fact that we need different DQ measures of different DBs, depending on context, the conventional definition alone is vague and unhelpful when it comes to using it to provide clues for how to improve data quality. How do we improve accuracy? Depends on how the data are recorded. How do we improve consistency? Depends immensely on the kind of data we have, etc … David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Towards modern definitions of

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Towards modern definitions of DQ An ideal definition of data quality: – Is backed up by a good understanding of how and why errors occur. – Relates to how the data are used – Is helpful in leading to improvements in processes – Is helpful in leading to usable measures/metrics David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The Data Quality Continuum

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: – – – Data gathering Data delivery Data storage Data integration Data retrieval Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: Example This

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example:

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example: gathering Data gathering centres ISC David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example:

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. Email or ftp to ISC; some centres provide raw data, some provide interpreted data (e. g. maybe won’t send some data if they believe it in error in the first place) David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example:

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand brain, and record their findings in archival CDs David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example:

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The ISC DQ Continuum

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case? David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur (gathering) • Manual data entry (how can we improve this? ) • Lack of uniform standards format and content. • Duplicates arising from parallel entry • Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. • Measurement errors. David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur (delivery) • Multiple hops from source to DB – problems can happen anywhere • Inappropriate pre-processing (e. g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). • Transmission problems: buffer overflows, checks (did all files arrive, and all correctly? ) David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur (storage) • Poor, out of date or inappropriate metadata • Missing timestamps • conversion to storage format (e. g. to excel files, to higher/lower precision David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur (integration) This is the business of combining datasets – e. g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. • Different keys, different fields, different formats • Different definitions (`customer’, `income’, …) • Sociological factors: reluctance to share! David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: • The source DB is not properly understood! • Straightforward mistakes in the queries that retrieve the relevant data. E. g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here? David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What this lecture was

These slides are at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/ What this lecture was about • • A conventional definition of DQ Problems with that definition The DQ Continuum Where errors might occur along that continuum David Corne, room EM G. 39, x 3410, dwcorne@macs. hw. ac. uk / any questions, feel free to contact me