Towards Certain Fixes with Editing Rules and Master

  • Slides: 27
Download presentation
Towards Certain Fixes with Editing Rules and Master Data Wenfei Fan Shuai Ma Nan

Towards Certain Fixes with Editing Rules and Master Data Wenfei Fan Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh Jianzhong Li Harbin Institute of Technology

What is wrong with our data? 81 million National Insurance numbers but only 60

What is wrong with our data? 81 million National Insurance numbers but only 60 million eligible citizens • In a 500, 000 customer database, 120, 000 customer records become invalid within 12 months 500, 000 dead people • Data error rates in industry: 1% - 30% (Redman, 1998) Pentagon asked 200+ retain active dead officers to re-enlist Medicare cards Real-life data is often dirty

Dirty data is costly ü Poor data costs US businesses $611 billion annually ü

Dirty data is costly ü Poor data costs US businesses $611 billion annually ü Erroneously priced data in retail databases costs US 2000 In US, 98, 000 customers $2. 5 billion each year deaths ü 1/3 of systemeach development projectscaused were forced year were by to delay or 2001 cancel due to poor data quality ü errors in medical data 30%-80% of the development time and budget for data 1998 warehousing are for data cleaning These highlight the need for data cleaning

Integrity constraints ü A variety of integrity constraints were developed to capture inconsistencies: –

Integrity constraints ü A variety of integrity constraints were developed to capture inconsistencies: – Functional dependencies (FDs) – Inclusion dependencies (INDs) – Conditional functional dependencies (CFDs) – Denial constraints – … FN LN AC phn Bob Brady 020 079172485 type 2 str city zip item Edi EH 7 4 AH CD 501 Elm St. Edi 1. [AC=020] →[city=Ldn] 2. [AC=131] →[city=Edi] These constraints help us determine whether data is dirty or not, however… 4

Limitation of previous method This does not fix the error t[AC], and worse still,

Limitation of previous method This does not fix the error t[AC], and worse still, messes up the correct attribute t[city] 1. [AC=020] →[city=Ldn] 2. [AC=131] →[city=Edi] FN LN AC phn type str city zip item 020 Bob Brady 020079172485 2 2 501 Elm Elm. St. Ldn Edi EH 74 AH CD CD t. Bob Edi Bob Brady 020 131 079172485 2 501 Elm St. Edi EH 7 4 AH CD Data cleaning methods based on integrity constraints only capture 5 inconsistencies

The quest for a new data cleaning approach ü The previous methods do not

The quest for a new data cleaning approach ü The previous methods do not guarantee that we have certain fixes – 100% correct fix. They do not work when repairing critical data Seemingly minor errors mean life or death! ü In fact we want a data cleaning method to guarantee the following: – Every update guarantees to fix an error, although we may not fix all the errors; – Repairing process does not introduce new error. We need certain fixes when cleaning critical data 6

Outline ü A approach to computing certain fixes – – Data monitoring Master data

Outline ü A approach to computing certain fixes – – Data monitoring Master data Editing rules Certain regions ü Fundamental problems ü Heuristic algorithms for computing certain regions ü Experimental study 7

How do we achieve certain fixes? …… t …… Data Monitoring far less costly

How do we achieve certain fixes? …… t …… Data Monitoring far less costly to correct a tuple at the point of data entry than fixing it afterward. 8

How do we achieve certain fixes? …… Master Data FN Data Monitoring Master data

How do we achieve certain fixes? …… Master Data FN Data Monitoring Master data is a single repository of high-quality data that provides various applications with a synchronized, consistent view of its core business entities. Mphn LN AC Hphn str city zip DOB gender Robert Brady 131 6884563 079172485 501 Elm Row Edi Mark …… t EH 7 4 AH 11/11/55 M Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE 25/12/67 M Master relation Dm 9

How do we achieve certain fixes? …… t Master Data …… Data Monitoring Editing

How do we achieve certain fixes? …… t Master Data …… Data Monitoring Editing Rules Σ Editing rules are a class of new data quality rules, which tell us how to fix data. 10

Editing Rules 1 – home phone 2 – mobile phone certain type=2 t 1

Editing Rules 1 – home phone 2 – mobile phone certain type=2 t 1 FN LN AC phn type str city zip item Robert Bob Brady 131 020 079172485 2 501 Elm Row St. 501 Edi EH 7 4 AH CD Input relation R FN LN AC Hphn Mphn str city zip DOB s 1 Robert Brady 131 6884563 079172485 501 Elm Row Edi s 2 Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE 25/12/67 Mark EH 7 4 AH 11/11/55 gender M • φ1: ((zip, zip) → (AC, str, city), tp 1 = ( )) M Master relation Dm • φ4: ((phn, Mphn) → (FN, LN), tp 2[type] = (2)) Applying editing rules don’t introduce new errors 11

Editing rules vs. integrity constraints ü Dynamic semantics – Editing rules tell us which

Editing rules vs. integrity constraints ü Dynamic semantics – Editing rules tell us which attributes to change and how to change them – Integrity constraints have static semantics. ü Information from master data – Editing rules are defined on two relation (master relation and input relation). – Some integrity constraints (e. g. FDs, CFDs) are usually defined on a single relation. ü Certain attributes – Editing rules rely on certain attributes – Integrity constraints don’t. Editing rules are quite different from integrity constraints 12

Regions certain FN LN AC phn type str city zip item Bob Brady 020

Regions certain FN LN AC phn type str city zip item Bob Brady 020 079172485 2 501 Elm Row St. 501 Edi EH 7 4 AH CD ü A region is a pair (Z, Tc), – (Z, Tc) = (AC, phn, Mphn type) FN LN • ZAC Hphn str city zip DOB Robert Brady 131= 6884563 Elm Row any Edi value, EH 7 4 AH • Tc {(0800, 079172485 _, 1)} /*501{(≠ 0800, =111/11/55 )}*/ Mark gender M Not 25/12/67 satisfying Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE M (Z, Tc) type ≠ 1 Not satisfying (Z, Tc) • φ1: ((zip, zip) → (AC, str, city), tp 1 = ( )) Satisfying (Z, Tc) t[Z] is not certain FN LN AC phn type Bob Brady 020 079172485 2 Rob Brady 131 688456 1 Robert Brady 131 6884563 1 str zip item 501 Elm St. Edi EH 7 4 AH CD 501 Elm St. Edi EH 7 4 AH BOOK Null city Ldn null CD × × √ Tuple t satisfying a region (Z, Tc): t[Z] is certain AND t[Z] match Tc 13

Fundamental problems - Unique fixes ü φ1: ((zip, zip) → (str, city), tp 1

Fundamental problems - Unique fixes ü φ1: ((zip, zip) → (str, city), tp 1 = ( )) ü When φ2: When (((phn, Hphn), (AC, AC)) (city), tp 2[type, AC] = 0800)) t[zip, t[AC, phn, phn] zip, is city] certain, is→certain, there exists multiple a (1, unique fixes fix Mary Brady 020 6884563 1 501 Elm Elm. Row St. 501 Edi Ldn EH 7 4 AH CD Mary Brady 020 6884563 1 501 Elm Elm. Row St. Edi EH 7 4 AH CD certain t FN LN AC Phn Type Str city zip item Mary Brady 020 6884563 1 501 Elm St. Edi EH 7 4 AH CD Input relation R FN LN Robert Brady Mark AC Hphn Mphn str city 131 6884563 079172485 501 Elm Row Edi Smiths 020 6884563 075568485 20 Baker St. zip DOB gender EH 7 4 AH 11/11/55 M Ldn NW 1 6 XE 25/12/67 M Master relation Dmdon’t introduce conflicts We must ensure that editing rules 14

Consistency problem ü Input : rules Σ, master relation Dm, input relaton R, region

Consistency problem ü Input : rules Σ, master relation Dm, input relaton R, region (Z, Tc) ü Output: – True, if each tuple statisfying (Z, Tc) has a unique fix; – False, otherwise. co. NP-complete Coverage problem is intractable

Unique fixes are not enough • φ1: ((zip, zip) → (AC, city, str), tp

Unique fixes are not enough • φ1: ((zip, zip) → (AC, city, str), tp 1 = ( )) • φ2: ((phn, Mphn) → (FN, LN), tp 2[type] = (2)) Is t[FN, AC] LN, =item] • φ3: (((phn, Hphn), (AC, AC) → (str, city, zip), tp 3[type, (1, correct? 0800)) t FN LN AC Robot Brady 131 certain phn type str city 688456 1 501 Elm Row St. 501 Edi certain zip item EH 7 4 AH BOOK Input relation R FN s LN Robert Brady AC Hphn Mphn str 131 6884563 079172485 501 Elm Row city Edi zip DOB EH 7 4 AH 11/11/55 gender M Master relation Dm Region. Not (Z, all Tc), where Z =be (AC, phn, type, Tc = {[_, _, _, _]} errors could fixed even if it zip), is consistent 16

Certain region ü ((zip, We say (Z, city, Tc) str), is a tp 1

Certain region ü ((zip, We say (Z, city, Tc) str), is a tp 1 certain • φ1: zip) that → (AC, = ( ))region for (Σ, Dm), if for any tuple • φ2: ((phn, Mphn) (Z, → (FN, t satisfying Tc), LN), tp 2[type] = (2)) • φ3: (((phn, Hphn), (AC, AC) → a(str, city, zip), tp 3[type, = (1, attributes in – Not only tuple t has unique fix, but also: AC] all the 0800)) tuple t could be correctly fixed. We call this “certain fix” t certain FN LN AC phn type str city zip item Robot Robert Brady 131 079172485 2 501 Elm Row St. 501 Edi EH 7 4 AH BOOK zip DOB Input relation R FN LN Robert Brady AC Hphn Mphn str city 131 6884563 079172485 501 Elm Row Edi EH 7 4 AH 11/11/55 gender M Master relation Dm (Z, fixes: Tc) , all where Z=(phn, type, zip, item) and correct Certain the attributes in t are guaranteed Tc[phn, type, zip]={[079172485, 2, ”EH 7 4 AH”]} 17

Coverage problem ü Input : rules Σ, master relation Dm, input relation R, region

Coverage problem ü Input : rules Σ, master relation Dm, input relation R, region (Z, Tc) ü Output: – True, if each tuple satisfying (Z, Tc) has a certain fix; – False, otherwise. ü co. NP-complete Coverage problem is intractable 18

How do we achieve certain fixes? t is clean now …… …… t We

How do we achieve certain fixes? t is clean now …… …… t We want find certain region (Z, Tc) with minimum |Z| : to reduce the If t satisfies (Z, Tc), on assuring the we can fix all otherusers’ efforts Data certa correctness of t[Z] in re attributes gion Monitoring Masterin t. Data Editing Rules Σ Computing Candidate Certain Regions regio k certain ns Users We compute a set of Users choose certain one (Z, regions Tc), and assure the correctness of t[Z] for users to choose Computing candidate certain regions becomes the central problem 19

Challenges of computing certain regions Problems Compute the minimum Z that (Z, Tc) Complexity

Challenges of computing certain regions Problems Compute the minimum Z that (Z, Tc) Complexity is a certain region, and Tc ≠ Φ. Consistency co. NP-complete Coverage co. NP-complete Z-minimum NP-complete Approximation-Hard Z-valadating NP-complete Z-counting #P-complete Computing optimal certain regions is challenging 20

Heuristic algorithm for computing certain regions AC=020 zip=EH 9 str=20 Baker St zip=NW 1

Heuristic algorithm for computing certain regions AC=020 zip=EH 9 str=20 Baker St zip=NW 1 str=501 Elm Row city=Ldn Adopt heuristic for zip=EH 8 AC=131 algorithm enumerating cliques city=Edi Computing Reduced Compatible Graph G Finding k maximal cliques in G Generating a certain region for each maximal clique We can guarantee to find a non-empty set of certain regions 21

Experimental Study – Data sets ü HOSP (Hospital Compare) data is publicly available from

Experimental Study – Data sets ü HOSP (Hospital Compare) data is publicly available from U. S. Department of Health & Human Services. – There are 37 editing rules designed for HOSP. ü DBLP data is from the DBLP Bibliography. – There are 16 editing rules designed for DBLP. ü TPC-H data is from the TPC-H dbgen generator. – There are 55 editing rules designed for TPC-H. ü RAND data was randomly generated for scalability test. Both real life and synthetic data were used to evaluate our algorithm

Tuple Level Recall recalltuple = # of corrected tuples / # of error tuples

Tuple Level Recall recalltuple = # of corrected tuples / # of error tuples Varying |Dm| More informative master data is, the more tuples can be fixed

Attribute Level F-Measure F-measure = 2(recallattr · precisionattr)/(recallattr + precisionattr) We compared our approach

Attribute Level F-Measure F-measure = 2(recallattr · precisionattr)/(recallattr + precisionattr) We compared our approach with Inc. Rep – an incremental algorithm for data repairing Varying using CFDs. noise rate Our approach generally out performs in F-Measure

Scalability Varying #Varying of Varying maximal |Dm|Σ| |cliques Our algorithm scales well with large

Scalability Varying #Varying of Varying maximal |Dm|Σ| |cliques Our algorithm scales well with large |Dm|, k and |Σ|

Conclusion In the context of previous approachs, this one is to find certain fixes

Conclusion In the context of previous approachs, this one is to find certain fixes and …… guarantee the correctness of repairing. Master Data Editing Rules Σ t …… Data certa in re gion Monitoring Fundamental problems and their complexity and approximation bounds Computing User ions g e r in a t r Candidate k ce Editing rules Certain Regions. A graph-based heuristic algorithm A first step towards certain fixes with editing rules and master data 26

Future Work …… …… t Master Data Editing Rules Σ Data certa in re

Future Work …… …… t Master Data Editing Rules Σ Data certa in re Monitoring ion data? Cleaning collectiongof Heuristic algorithm for consistency? Computing Candidate Certain Regions regio k certain Discovering editing rules? Naturally much more to be done ns User 27