April 1 st ICDE 2014 Floris Geerts University

  • Slides: 41
Download presentation
April, 1 st ICDE 2014 Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro

April, 1 st ICDE 2014 Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning –

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 2 April, 1 2014

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning –

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 3 April, 1 2014

A Mapping and Cleaning Task Source 1 Source 2 STRONGL Y INTERREL ATE D

A Mapping and Cleaning Task Source 1 Source 2 STRONGL Y INTERREL ATE D PROBLEM S 4 Target Schema Mapping System Source 3 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro Data Cleaning Tools April, 1 2014

5 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS

5 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat t 2 123 Name Phone Str t 3 123 Med SOURCE #3 LA Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 Date Treatments Eye 12/01/201 surg. 3 (CONFIDENCE 1. 0) SSN Salary Master Data tm Name City W. Smith 324 -0000 Pico Blv. Insuranc Treat. e Customers SSN Surgeries SSN TARGET 324 Pico Lapa 03/11/201 LA Med Blvd r. 3 (C 3455 ONFIDENCE 0. 5) Patients SSN Date SSN Name Phone Street City 222 F. Lennon 122 -1876 Sky Dr. SF Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

6 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS

6 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat 324 Pico Lapa 03/11/201 LA Med Blvd r. 3 (C 3455 ONFIDENCE 0. 5) Step 1: To exchange data from source to target Patients SSN t 2 123 Name Phone Str t 3 Insuranc Treat. e 123 Med SOURCE #3 LA Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 Date Treatments Eye 12/01/201 surg. 3 (CONFIDENCE 1. 0) SSN Salary Master Data tm Name City W. Smith 324 -0000 Pico Blv. TARGET Customers SSN Surgeries SSN Date SSN Name Phone Street City 222 F. Lennon 122 -1876 Sky Dr. SF Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

7 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) ST-TGD [Popa et

7 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) ST-TGD [Popa et al. , VLDB’ 02 Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 t 2 Str Schema. Date Mappings City Insur Treat trasformation can be expressed Customers as a set of source to target Ph. Co Patients SSN Name Phone nf tuple generating dependencies Phone Str City 122 t 4 222 L. Lennon 0. 9 (st-tgds) 1876 324 -0000 Pico Blv. LA 324 Pico Lapa 03/11/201 LA Med Blvd r. 3 (C 3455 ONFIDENCE 0. 5) SSN Name 123 W. Smith t 5 222 L. Lennon Surgeries SSN t 3 Insuranc Treat. e 123 Med SOURCE #3 0000000 Str. City CC# Null NY 781658 0. 0 Fry SF 784659 Date Treatments Eye 12/01/201 surg. 3 (CONFIDENCE 1. 0) SSN Salary Master Data tm TARGET SSN Name Phone Street City 222 F. Lennon 122 -1876 Sky Dr. SF Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

8 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) ST-TGD Medical. Treatments

8 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) ST-TGD Medical. Treatments SS N Name t 1 124 W. Smith Phone 3243455 Str Pico Blvd City Insur Treat LA Med Date TARGET Lapa 03/11/201 r. 3 Customers SSN Name Ph. Co Str. City nf CC# 1221876 0. 9 Null NY 781658 222 L. Lennon 0000000 0. 0 Fry SF 784659 124 W. Smith 324 Pic 0. 7 LA 3455 o Treatments t 4 222 L. Lennon Source-to-Target TGD t 5 Med. Treat(ssn, n, p, s, c, i, t, d) → ∃Y 3, Y 4 : Cust(ssn, n, p, 0. 7, s, c, t 8 Y 3), Treat(ssn, Y 4, i, t, d) Phone SSN Salary Null Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

9 A Motivating Example SOURCE #2 (CONFIDENCE 0. 5) ST-TGD Patients SSN t 2

9 A Motivating Example SOURCE #2 (CONFIDENCE 0. 5) ST-TGD Patients SSN t 2 123 Name Phone Str W. Smith 324 -0000 Pico Blv. Pre-Solution for the TGDs City TARGET LA Customers Surgeries Insuranc SSN Treat. e t 3 123 Med Eye surg. SSN Date 12/01/201 3 Name t 4 222 L. Lennon Source-to-Target TGD t 5 Pat(ssn, n, p, s, c), Surg(ssn, i, t, d) → ∃Y 3, Y 4 : Cust(ssn, n, p, 0. 5, s, c, t 8 Y 3), t 10 Treat(ssn, Y 4, i, t, d) 222 L. Lennon Phone Ph. Co Str. City nf CC# 1221876 0. 9 Null NY 781658 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic 123 W. Smith 0. 5 LA Null Insuranc o SSN Salary 0000 Treat. Date e 124 W. Smith t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro Null Med Eye surg. 12/01/2013 April, 1 2014

10 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS

10 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat t 2 123 Step 2: To ensure Data Quality SSN Name Phone Str W. Smith 324 -0000 Pico Blv. Insuranc Treat. e 123 Med SOURCE #3 Name City LA Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 Date Treatments Eye 12/01/201 surg. 3 (CONFIDENCE 1. 0) SSN Salary Master Data tm Customers SSN Surgeries t 3 TARGET 324 Pico Lapa 03/11/201 LA Med Blvd r. 3 (C 3455 ONFIDENCE 0. 5) Patients SSN Date SSN Name Phone Street City 222 F. Lennon 122 -1876 Sky Dr. SF Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

11 A Motivating Example Functional Dependencies ST-TGD FD fd 1. Cust: SSN→ Name, Phone,

11 A Motivating Example Functional Dependencies ST-TGD FD fd 1. Cust: SSN→ Name, Phone, Str, City, CC# fd 2. Cust: Name, Str, City → SSN Name fd 3. Treat: SSN → Salary Inclusion Dependencies id 4. Treat[SSN] ⊆ Customers[SSN] Conditional Functional Dependencies cfd 5. Treat: Insur[‘Abx’] → Tr[‘Dental’] ID CFD ER Customers Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith cfd 6. IF Treat: Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 er 7. IF Cust. SSN = MD. SSN, Cust. Phone = MD. Phone → TAKE Name, Street from MD t 11 123 Null Med Eye surg. 12/01/2013 Master Data Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro tm SSN Name 222 F. Lennon Phone Street 2014 122 -1876 April, Sky 1 Dr.

12 A Motivating Example Functional Dependencies ST-TGD FD fd 1. Cust: SSN→ Name, Phone,

12 A Motivating Example Functional Dependencies ST-TGD FD fd 1. Cust: SSN→ Name, Phone, Str, City, CC# fd 2. Cust: Name, Str, City → SSN Name fd 3. Treat: SSN → Salary Inclusion Dependencies t 4 222 L. Lennon VIOLATIONS id 4. Treat[SSN] ⊆ Customers[SSN] Conditional Functional Dependencies cfd 5. Treat: Insur[‘Abx’] → Tr[‘Dental’] t 5 222 L. Lennon ID CFD ER Customers Phone Ph. Co Str. City nf CC# 1221876 0. 9 Null NY 781658 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith cfd 6. IF Treat: Insur[‘Abx’] THEN Cust: City[‘SF’] Editing Rules t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 er 7. IF Cust. SSN = MD. SSN, Cust. Phone = MD. Phone → TAKE Name, Street from MD t 11 123 Null Med Eye surg. 12/01/2013 Master Data Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro tm SSN Name 222 F. Lennon Phone Street 2014 122 -1876 April, Sky 1 Dr.

13 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS

13 A Motivating Example SOURCE #1 C ( ONFIDENCE 0. 7) Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat t 2 123 t 3 Name Phone Ph. Co Str. City nf CC# Previous Semantics? Name Phone Str City W. Smith 324 -0000 Pico Blv. LA Insuranc Treat. e 123 Med SOURCE #3 t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 Date Treatments Eye 12/01/201 surg. 3 (CONFIDENCE 1. 0) SSN Salary Master Data tm Customers SSN Surgeries SSN TARGET 324 Pico Lapa 03/11/201 LA Med Blvd r. 3 (C 3455 ONFIDENCE 0. 5) Patients SSN Date SSN Name Phone Street City 222 F. Lennon 122 -1876 Sky Dr. SF Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

14 Data Exchange [Fagin et al. , TCS ’ 05] ST-TGD • Elegant semantics

14 Data Exchange [Fagin et al. , TCS ’ 05] ST-TGD • Elegant semantics • Scalable algorithms ID FD CFD ER fd 1. Cust: SSN→ Name, Phone, Str, City, CC# Customers SSN t 4 222 Name Phone L. 122 -1876 Lennon L. t 5 222 000 -0000 Lennon t 8 124 W. Smith 324 -3455 Soft Violation Ph. Co Str. City CC# nf Null NY 0. 0 78465 Fry SF 9 0. 7 Pico LA Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro W. 78165 8 0. 9 Hard Violation Null April, 1 2014

15 Data Repairing • FD ID CFD ER Hard ERViolation AC TIO Customers N!

15 Data Repairing • FD ID CFD ER Hard ERViolation AC TIO Customers N! INT TGD Many approaches and techniques SSN Name Phone [Bohannon SIGMOD ’ 05] [Cong VLDB ’ 07] 122 t 4 222 L. Lennon [Kolahi ICDT ’ 09] [Fan VLDB ’ 10] 1876 [Beskales VLDB ’ 10] • • t 5 222 L. Lennon No support for mapping • No way to handle our example 0000000 Ph. Co Str. City nf CC# 0. 9 Null NY 781658 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith Main-memory implementation only! t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Eye surg. 12/01/2013 Master Data Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro tm SSN Name 222 F. Lennon Phone Street 2014 122 -1876 April, Sky 1 Dr.

16 Pipeline Pre. Solution for STTGDs Source 1 Source 2 Cleaned Target Data Repairin

16 Pipeline Pre. Solution for STTGDs Source 1 Source 2 Cleaned Target Data Repairin g Data Exchang e ✔ Mappings ✔ ✗ Cleaning Rules ✔ Cleaning ✔ ✗ Mappings Rules • Negative Result: There exist scenarios such that pipeline doesn’t return solutions • Even when it works, its quality is usually poor Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

17 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS

17 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat 324 Pico Lapa LA Med Blvd r. (C 3455 ONFIDENCE 0. 5) Patients SSN t 2 123 Name Phone Str W. Smith 324 -0000 Pico Blv. SSN Name Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 City LA Surgeries Insuranc SSN Treat. e t 3 123 Med Eye surg. Treatments Date 12/01/201 3 SSN Salary Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

18 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS

18 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat 324 Pico Lapa LA Med Blvd r. (C 3455 ONFIDENCE 0. 5) Patients SSN t 2 123 Name Phone Str W. Smith 324 -0000 Pico Blv. t 3 123 Insuranc Treat. e Med Eye surg. Name Ph. Co Str. City nf CC# 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 t 8 124 W. Smith 3243455 0. 7 Pic LA o LA Null Pre. Solution for Pic TGDs 324 - Date 12/01/201 3 Phone t 4 222 L. Lennon t 10 123 W. Smith Surgeries SSN City SSN Salary 0. 5 0000 o Treatments LA Null Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro Eye surg. 12/01/2013 April, 1 2014

19 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS

19 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat 324 Pico Lapa LA Med Blvd r. (C 3455 ONFIDENCE 0. 5) Patients SSN t 2 123 Name Phone Str W. Smith 324 -0000 Pico Blv. t 3 123 Insuranc Treat. e Med Eye surg. LA Name Ph. Co Str. City nf CC# 1221876 0. 9 Null NY 781658 12 L. Lennon t 5 222 3 0000000 0. 0 Fry SF 784659 3243455 Name, Str, t 8 124 W. Smith fd 2. Cust: SSN Date 12/01/201 3 Phone t 4 222 L. Lennon t 10 123 W. Smith Surgeries SSN City SSN Salary 0. 7 City Pic LA o → 324 Pic 0. 5 LA 0000 o Treatments Null Insuranc e Treat. Date t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro Eye surg. 12/01/2013 April, 1 2014

20 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS

20 Pipeline TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS N Name Phone W. t 1 124 Smith SOURCE #2 Str City Insur Treat 324 Pico Lapa LA Med Blvd r. (C 3455 ONFIDENCE 0. 5) Patients SSN t 2 123 Name Phone Str W. Smith 324 -0000 Pico Blv. t 3 123 Insuranc Treat. e Med Eye surg. LA Name Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 12 L. Lennon t 5 222 3 0000000 0. 0 Fry SF 784659 t 8 124 W. Smith 3243455 0. 7 Pic LA o Null 324 Pic 0. 5 LA Null 0000 o 324 -Treatments Pic t 12 124 W. Smith 0. 7 LA Null 3455 o Insuranc SSN Salary Treat. Date e t 10 123 W. Smith Surgeries SSN City SSN Date 12/01/201 3 t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro Eye surg. 12/01/2013 April, 1 2014

21 Contributions A Uniform Framework for Type 1 Type 2 Type 3 MD Schema

21 Contributions A Uniform Framework for Type 1 Type 2 Type 3 MD Schema Mapping Scenarios ST-TGD ID FD MD Data Repairing Scenarios FD CFD ER Mapping and Cleaning Scenarios TGD With a fast and general-purpose. FD CFD chase engine Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro ID ER April, 1 2014

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning –

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 22 April, 1 2014

23 Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13] • An extension

23 Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13] • An extension of the data-repairing framework • Let’s see a quick summary… 1. Partial Order 2. Cell Groups 3. LLUNs 4. Upgrades Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

PR 24 EFE VAL RRED UE • Llunatic Data Repairing [Geerts et al. ,

PR 24 EFE VAL RRED UE • Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13] Customers The Partial Order Π • Elegant way to model preference rules SSN Name Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith Standard preference rules Ordering attribute No order t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Eye surg. 12/01/2013 Master Data tm Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro SSN Name Phone Street 222 F. Lennon 122 -1876 Sky April, 1 2014

 • • 25 Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13]

• • 25 Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13] Customers The Partial Order Π • Elegant way to model preference rules LLUNs • • a new class of symbols placeholders used to mark conflicts SSN Name Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 1221876 0. 9 L 0 Null NY 781658 t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Eye surg. 12/01/2013 Master Data tm Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro SSN Name Phone Street 222 F. Lennon 122 -1876 Sky April, 1 2014

 • • • 26 Llunatic Data Repairing [Geerts et al. , VLDB ‘

• • • 26 Llunatic Data Repairing [Geerts et al. , VLDB ‘ 13] Customers The Partial Order Π • Elegant way to model preference rules LLUNs • a new class of symbols Name Phone Ph. Co Str. City nf CC# t 4 222 L. Lennon 122122 -1876 0. 9 Sk NY 781658 Null y t 5 222 L. Lennon 0000000 0. 0 Fry SF 784659 324 Pic 0. 7 LA Null 3455 o Treatments 324 Pic t 10 123 W. Smith LA Null Insuranc 0. 5 o SSN Salary 0000 Treat. Date e t 8 124 W. Smith t 6 222 10 k Abx Dental 10/01/11 t 7 222 25 k Abx Cholest. 08/12/12 Represent the set of changes t 9 124 Null Med Lapar. 03/11/2013 t 11 123 Null Med Eye surg. 12/01/2013 Master Data Cell Groups • SSN g 1 = <122→ {t 4. phn, t 5. phn} > g 2 = <Sky→ {t 4. str, t 5. str} by {tm. strauth} > tm Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro SSN Name Phone Street 222 F. Lennon 122 -1876 Sky April, 1 2014

27 Upgrades Upgrade: an improvement over J, since it contains better value wrt Π

27 Upgrades Upgrade: an improvement over J, since it contains better value wrt Π Cardinality Minimal SS N Name CC# L. Lennon L 0 L. Update t 3 222 Lennon 1 L 0 t 2 222 g 1 <L 0→ {t 4. cc, t 5. cc}> SS N Name CC# SS N L. Lennon 111 t 2 222 L. Update t 3 222 Lennon 2555 t 2 L 1 g 2 <L 1→ {t 4. ssn}> Forward Upgrades Name CC# SS N L. Lennon 555 t 2 777 L. Update t 3 222 Lennon 3555 g 3 <555→ {t 4. cc, t 5. cc}> Name CC# SS N L. Lennon 111 t 2 222 L. Update t 3 222 Lennon 4555 g 4 <777→ {t 4. ssn}> Name CC# L. Lennon 333 L. t 3 222 333 Update Lennon 5 g 5 <333→ {t 4. cc, t 5. cc}> Backward J SS N Name Not an upgrade CC# t 2 222 L. Lennon 111 t 3 222 L. Lennon 555 e 1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, cc’) → cc = cc’ Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Upgrades SS N Name CC# t 2 L 2 L 2 t 3 L

Upgrades SS N Name CC# t 2 L 2 L 2 t 3 L 2 L 2 over generalizatio n Update 6 g 6 <L 2→ {allcells}> SS N Name CC# L. Lennon L 0 L. Update t 3 222 Lennon 1 L 0 t 2 222 g 1 <L 0→ {t 4. cc, t 5. cc}> SS N Name CC# L. Lennon 111 L. Update t 3 222 Lennon 2555 t 2 L 1 g 2 <L 1→ {t 4. ssn}> Forward Minimal Solutions 28 Backward J SS N Name CC# t 2 222 L. Lennon 111 t 3 222 L. Lennon 555 e 1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, cc’) → cc = cc’ Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

29 + ST-TGDs User Inputs = Non trivial extension! Mapping and Cleaning – F.

29 + ST-TGDs User Inputs = Non trivial extension! Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

30 Mapping and Cleaning Scenario • • M&C Scenario M={S, Sa, T, Σt, Σe,

30 Mapping and Cleaning Scenario • • M&C Scenario M={S, Sa, T, Σt, Σe, Π, U SER } S: source schema, S : authoritative source a tables T: target schema, Σt: TGDs, Σe: EGDs Π: the partial order specification USER: a partial function to abstract user interaction Solution: Given M, an instance I of S, and an instance J of T, a solution is an instance J’ such that: • it is a repair, i. e. , “I and J’ satisfy Σt ∪ Σ e” Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

31 How to handle TGDs TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers

31 How to handle TGDs TARGET SOURCE #1 C ( ONFIDENCE 0. 7) Customers Medical. Treatments SS N Name Phone Str City Insur Treat SSN Name KEY INTUITI Phone ON Ph. Co Str. City CC# nf W. 324 Pico Lapa t 1 124 LA Med Smith n, 3455 r. m 1: Med. Treat(ssn, p, s, c, i, Blvd t, d) → t 4 222 L. Lennon 1221876 0. 9 Null NY 781658 ∃Y 3, Y 4 : Cust(ssn, n, p, 0. 7, s, c, Y 3), Treat(ssn, Y 4, i, t, d) t 5 t 8 222 124 L. W. Lennon Smith 0003240000 3455 0. 0 0. 7 Pic Fry SF LA 784659 Null o • We model it in terms of cell groups and updates Treatments SSN Salary Insuranc e Treat. Date Dental 10/01/11 t 6 222 10 k Abx t 7 t 9 222 124 25 k Null Abx Med Cholest. 08/12/12 Lapar. 03/11/2013 g 1 = <124→ {t 8. ssnnew, t 9. ssnnew} by {t 1. ssn}> g 2 = <W. Smith→. . . {t 8. namenew} we do not disrupt key – fkey by {t 1. name}> equality in the following Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro new cells April, 1 2014

32 User Inputs • • In the presence of inconsistencies user inputs are crucial.

32 User Inputs • • In the presence of inconsistencies user inputs are crucial. User may • • change the value of a cell group refuse a cell group We model user interaction using a partial function over cell groups Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro SS N 55 5 t 2 L 1 Name Phon e L. Lennon 123 g 1 <123→ {t 4. ph, L. t 5. ph}> t 3 222 SS N t 2 222 Lennon 000 Name Phon e L. Lennon 123 <L 1→L. {t 4. ssn}> t 3 g 2222 123 Lennon April, 1 2014

Non trivial extension • • 33 Data cleaning semantics has some nice properties •

Non trivial extension • • 33 Data cleaning semantics has some nice properties • • scenario C always has a solution for <I, J> the chase always terminates (it never fails) Adding TGDs and User Inputs • • concept of upgrade change significantly requires to completely rework upgrades Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

34 Upgrades • Must take into account many issues • some target cells are

34 Upgrades • Must take into account many issues • some target cells are “better” than others • source cells may be authoritative • compare instances with different new values • compare instances with different number of tuples • some cells may be changed by users Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

35 A Few Results • Conservative extension of the data exchange • Every (core)

35 A Few Results • Conservative extension of the data exchange • Every (core) solution of a data exchange scenario corresponds to a (minimal) solution of its associated mapping scenario, and vice versa • Given a MC scenario, if Σ is a set of weaklyt acyclic tgds, then the chase terminates • in essence we may re-use termination conditions for data exchange Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning –

Overview ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 36 April, 1 2014

37 Chase Tree Different orders of application J give different results e 0, b

37 Chase Tree Different orders of application J give different results e 0, b 1 R 1 Chase algorithm for chasing egds and tgd e 0, f e 0, b 2 e 1, b 1 e 1, f R 2 R 3 R 1 0 e 1, b 1 R 4 e 1, f e 1, b 2 R 5 R 6 the e 0 -e 1 sequence 1 e 0, b 1 e 1, b 2 R 1 2 e 0, f e 0, b 2 R 1 R 1 3 4 5 the e 1 -e 0 sequence Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

38 Scalability Techniques • Chase implementation based on equivalence classes • Delta Databases •

38 Scalability Techniques • Chase implementation based on equivalence classes • Delta Databases • a representation system for chase trees • Cost managers • pluggable strategies to prune the chase tree Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

39 Scalability LLUNATIC-FR-S 1 LLUNATIC-FR-S 5 LLUNATIC-FR-S 10 LLUNATIC-FR-S 50 DOCTORS-MC sec. 5000 2500

39 Scalability LLUNATIC-FR-S 1 LLUNATIC-FR-S 5 LLUNATIC-FR-S 10 LLUNATIC-FR-S 50 DOCTORS-MC sec. 5000 2500 0 100 K 400 K Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 700 K 1000 K April, 1 2014

40 Quality of Repairs LLUNATIC-FR-S 1 PIPELINE HOSPITAL-MC NORM max. rep-rate(Rep, DBexp) 60% 30%

40 Quality of Repairs LLUNATIC-FR-S 1 PIPELINE HOSPITAL-MC NORM max. rep-rate(Rep, DBexp) 60% 30% 0% -30% 0, 01 0, 025 k, 6%-10% 0, 03 0, 04 0, 05 0, 01 0, 02 6%-10% 0, 03 0, 04 10 k, Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro 0, 05 0, 01 0, 02 25 k, 0, 03 0, 04 6%-10% 0, 05 April, 1 2014

l l a s ’ t a h T ! s k l o

l l a s ’ t a h T ! s k l o F Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014