Generic Entity Resolution Identifying RealWorld Entities in Large

  • Slides: 44
Download presentation
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang

Entity Resolution e 2 e 1 N: a A: b CC#: c Ph: e

Entity Resolution e 2 e 1 N: a A: b CC#: c Ph: e N: a Exp: d Ph: e 2

Applications • comparison shopping e 1 • mailing lists • classified ads N: a

Applications • comparison shopping e 1 • mailing lists • classified ads N: a A: b • customer files CC#: c Ph: e e 2 • counter-terrorism N: a Exp: d Ph: e 3

Outline • Why is ER challenging? • How is ER done? • Some ER

Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 4

Challenges (1) • No keys! • Value matching – “Kaddafi”, “Qaddafi”, “Kaddaffi”. . .

Challenges (1) • No keys! • Value matching – “Kaddafi”, “Qaddafi”, “Kaddaffi”. . . • Record matching Nm: Tom Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555 -1212 5

Challenges (2) • Merging records Nm: Tom Ad: 123 Main St Ph: (650) 555

Challenges (2) • Merging records Nm: Tom Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555 -1212 Zp: 94305 Nm: Tom Nm: Thomas Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Zp: 94305 6

Challenges (3) • Chaining Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk:

Challenges (3) • Chaining Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Wk: IBM Oc: laywer Sal: 500 K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K 7

Challenges (4) • Un-merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk:

Challenges (4) • Un-merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K too young to make 500 K at IBM!! 8

Challenges (5) • Confidences in data (0. 8) Nm: Tom (0. 9) Ad: 123

Challenges (5) • Confidences in data (0. 8) Nm: Tom (0. 9) Ad: 123 Main St (1. 0) Ph: (650) 555 -1212 (0. 6) Ph: (650) 777 -7777 (0. 8) • In value matching, match rules, merge: conf = ? 9

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences 10

Schema Differences Name: Tom Address: 123 Main St Ph: (650) 555 -1212 Ph: (650)

Schema Differences Name: Tom Address: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 First. Name: Tom Street. Name: Main St Street. Number: 123 Tel: (650) 777 -7777 11

Pair-Wise Snaps vs. Clustering r 1 r 2 s 7 r 5 r 2

Pair-Wise Snaps vs. Clustering r 1 r 2 s 7 r 5 r 2 s 9 r 3 r 4 r 1 r 8 s 10 s 8 r 7 r 3 r 9 r 10 r 4 r 5 r 6 12

De-Duplication vs. Fidelity Enhancement B R S S N 13

De-Duplication vs. Fidelity Enhancement B R S S N 13

Relationships r 2 r 1 father r 7 brother business r 5 14

Relationships r 2 r 1 father r 7 brother business r 5 14

Using Relationships authors same? ? papers a 1 p 1 a 2 p 2

Using Relationships authors same? ? papers a 1 p 1 a 2 p 2 a 3 p 5 a 4 p 7 a 5 15

Exact vs Approximate ER cameras CDs ER resolved cameras ER resolved CDs ER resolved

Exact vs Approximate ER cameras CDs ER resolved cameras ER resolved CDs ER resolved books products books . . . 16

Exact vs Approximate ER terrorists sort by age terrorists Widom 30 match against ages

Exact vs Approximate ER terrorists sort by age terrorists Widom 30 match against ages 25 -35 17

Generic vs Application Specific • Match function M(r, s) • Merge function <r, s>

Generic vs Application Specific • Match function M(r, s) • Merge function <r, s> => t 18

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences 19

Outline • Why is ER challenging? • How is ER done? • Some ER

Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 20

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences

Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences No • Relationships No • Exact vs. approximate • Generic vs application specific • Confidences. . . later on 21

Model r 1 Nm: Tom r 2 Ad: 123 Main BD: Jan 1, 85

Model r 1 Nm: Tom r 2 Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer M(r 1, r 2) Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer r 4: <r 1, r 2> r 3 Nm: Tom Wk: IBM Oc: laywer Sal: 500 K M(r 4, r 3) Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K <r 4, r 3> 22

Correct Answer r 1 r 2 s 7 s 9 r 3 r 4

Correct Answer r 1 r 2 s 7 s 9 r 3 r 4 ER(R) = All derivable records. . . s 10 Minus “dominated” records s 8 r 5 r 6 23

Question • What is best sequence of match, merge calls that give us right

Question • What is best sequence of match, merge calls that give us right answer? 24

Brute Force Algorithm • Input R: – r 1 = [a: 1, b: 2]

Brute Force Algorithm • Input R: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] 25

Brute Force Algorithm • Input R: • Match all pairs: – r 1 =

Brute Force Algorithm • Input R: • Match all pairs: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] – r 12 = [a: 1, b: 2, c: 4, e: 5] 26

Brute Force Algorithm • Match all pairs: • Repeat: – r 1 = [a:

Brute Force Algorithm • Match all pairs: • Repeat: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] – r 12 = [a: 1, b: 2, c: 4, e: 5] – r 123 = [a: 1, b: 2, c: 4, e: 5, f: 6] 27

Question # 1 Can we delete r 1, r 2? 28

Question # 1 Can we delete r 1, r 2? 28

Question # 2 Can we avoid comparisons? 29

Question # 2 Can we avoid comparisons? 29

ICAR Properties • Idempotence: – M(r 1, r 1) = true; <r 1, r

ICAR Properties • Idempotence: – M(r 1, r 1) = true; <r 1, r 1> = r 1 • Commutativity: – M(r 1, r 2) = M(r 2, r 1) – <r 1, r 2> = <r 2, r 1> • Associativity – <r 1, <r 2, r 3>> = <<r 1, r 2>, r 3> 30

More Properties • Representativity – If <r 1, r 2> = r 3, then

More Properties • Representativity – If <r 1, r 2> = r 3, then for any r 4 such that M(r 1, r 4) is true we also have M(r 3, r 4) = true. r 4 r 1 r 3 r 2 31

ICAR Properties Efficiency • Commutativity • Idempotence • Associativity • Representativity • Can discard

ICAR Properties Efficiency • Commutativity • Idempotence • Associativity • Representativity • Can discard records • ER result independent of processing order 32

Swoosh Algorithms • Record Swoosh • Merges records as soon as they match •

Swoosh Algorithms • Record Swoosh • Merges records as soon as they match • Optimal in terms of record comparisons • Feature Swoosh • Remembers values seen for each feature • Avoids redundant value comparisons 33

Swoosh Performance 34

Swoosh Performance 34

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph:

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] 35

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph:

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] Full Answer: ER(R) = {r 12, r 23, r 1, r 2, r 3} Minus Dominated: ER(R) = {r 12, r 23} 36

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph:

If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] Full Answer: ER(R) = {r 12, r 23, r 1, r 2, r 3} Minus Dominated: ER(R) = {r 12, r 23} R-Swoosh Yields: ER(R) = {r 12, r 3} or {r 1, r 23} 37

Swoosh Without ICAR Properties 38

Swoosh Without ICAR Properties 38

Distributed Swoosh P 1 P 2 P 3 r 1 r 2 r 3

Distributed Swoosh P 1 P 2 P 3 r 1 r 2 r 3 r 4 r 5 r 6. . . 39

Distributed Swoosh P 1 P 2 r 1 r 2 r 3 r 4

Distributed Swoosh P 1 P 2 r 1 r 2 r 3 r 4 r 6. . . r 4 r 5. . . P 3 r 2 r 3 r 5 r 6. . . 40

DSwoosh Performance 41

DSwoosh Performance 41

Outline • Why is ER challenging? • How is ER done? • Some ER

Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 42

Conclusion • ER is old and important problem • Our approach: generic • Confidences

Conclusion • ER is old and important problem • Our approach: generic • Confidences – challenging – two ways to tame: • thresholds • packages 43

Thanks. 44

Thanks. 44