Generic Entity Resolution Identifying RealWorld Entities in Large
- Slides: 44
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang
Entity Resolution e 2 e 1 N: a A: b CC#: c Ph: e N: a Exp: d Ph: e 2
Applications • comparison shopping e 1 • mailing lists • classified ads N: a A: b • customer files CC#: c Ph: e e 2 • counter-terrorism N: a Exp: d Ph: e 3
Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 4
Challenges (1) • No keys! • Value matching – “Kaddafi”, “Qaddafi”, “Kaddaffi”. . . • Record matching Nm: Tom Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555 -1212 5
Challenges (2) • Merging records Nm: Tom Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555 -1212 Zp: 94305 Nm: Tom Nm: Thomas Ad: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 Zp: 94305 6
Challenges (3) • Chaining Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Wk: IBM Oc: laywer Sal: 500 K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K 7
Challenges (4) • Un-merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K too young to make 500 K at IBM!! 8
Challenges (5) • Confidences in data (0. 8) Nm: Tom (0. 9) Ad: 123 Main St (1. 0) Ph: (650) 555 -1212 (0. 6) Ph: (650) 777 -7777 (0. 8) • In value matching, match rules, merge: conf = ? 9
Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences 10
Schema Differences Name: Tom Address: 123 Main St Ph: (650) 555 -1212 Ph: (650) 777 -7777 First. Name: Tom Street. Name: Main St Street. Number: 123 Tel: (650) 777 -7777 11
Pair-Wise Snaps vs. Clustering r 1 r 2 s 7 r 5 r 2 s 9 r 3 r 4 r 1 r 8 s 10 s 8 r 7 r 3 r 9 r 10 r 4 r 5 r 6 12
De-Duplication vs. Fidelity Enhancement B R S S N 13
Relationships r 2 r 1 father r 7 brother business r 5 14
Using Relationships authors same? ? papers a 1 p 1 a 2 p 2 a 3 p 5 a 4 p 7 a 5 15
Exact vs Approximate ER cameras CDs ER resolved cameras ER resolved CDs ER resolved books products books . . . 16
Exact vs Approximate ER terrorists sort by age terrorists Widom 30 match against ages 25 -35 17
Generic vs Application Specific • Match function M(r, s) • Merge function <r, s> => t 18
Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences 19
Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 20
Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences No • Relationships No • Exact vs. approximate • Generic vs application specific • Confidences. . . later on 21
Model r 1 Nm: Tom r 2 Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer M(r 1, r 2) Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer r 4: <r 1, r 2> r 3 Nm: Tom Wk: IBM Oc: laywer Sal: 500 K M(r 4, r 3) Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500 K <r 4, r 3> 22
Correct Answer r 1 r 2 s 7 s 9 r 3 r 4 ER(R) = All derivable records. . . s 10 Minus “dominated” records s 8 r 5 r 6 23
Question • What is best sequence of match, merge calls that give us right answer? 24
Brute Force Algorithm • Input R: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] 25
Brute Force Algorithm • Input R: • Match all pairs: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] – r 12 = [a: 1, b: 2, c: 4, e: 5] 26
Brute Force Algorithm • Match all pairs: • Repeat: – r 1 = [a: 1, b: 2] – r 2 = [a: 1, c: 4, e: 5] – r 3 = [b: 2, c: 4, f: 6] – r 4 = [a: 7, e: 5, f: 6] – r 12 = [a: 1, b: 2, c: 4, e: 5] – r 123 = [a: 1, b: 2, c: 4, e: 5, f: 6] 27
Question # 1 Can we delete r 1, r 2? 28
Question # 2 Can we avoid comparisons? 29
ICAR Properties • Idempotence: – M(r 1, r 1) = true; <r 1, r 1> = r 1 • Commutativity: – M(r 1, r 2) = M(r 2, r 1) – <r 1, r 2> = <r 2, r 1> • Associativity – <r 1, <r 2, r 3>> = <<r 1, r 2>, r 3> 30
More Properties • Representativity – If <r 1, r 2> = r 3, then for any r 4 such that M(r 1, r 4) is true we also have M(r 3, r 4) = true. r 4 r 1 r 3 r 2 31
ICAR Properties Efficiency • Commutativity • Idempotence • Associativity • Representativity • Can discard records • ER result independent of processing order 32
Swoosh Algorithms • Record Swoosh • Merges records as soon as they match • Optimal in terms of record comparisons • Feature Swoosh • Remembers values seen for each feature • Avoids redundant value comparisons 33
Swoosh Performance 34
If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] 35
If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] Full Answer: ER(R) = {r 12, r 23, r 1, r 2, r 3} Minus Dominated: ER(R) = {r 12, r 23} 36
If ICAR Properties Do Not Hold? r 12: [Joe Sr. , 123 Main, Ph: 123, DL: X] r 23: [Joe Jr. , 123 Main, Ph: 123, DL: Y] r 1: [Joe Sr. , 123 Main, DL: X] r 3: [Joe Jr. , 123 Main, DL: Y] r 2: [Joe, 123 Main, Ph: 123] Full Answer: ER(R) = {r 12, r 23, r 1, r 2, r 3} Minus Dominated: ER(R) = {r 12, r 23} R-Swoosh Yields: ER(R) = {r 12, r 3} or {r 1, r 23} 37
Swoosh Without ICAR Properties 38
Distributed Swoosh P 1 P 2 P 3 r 1 r 2 r 3 r 4 r 5 r 6. . . 39
Distributed Swoosh P 1 P 2 r 1 r 2 r 3 r 4 r 6. . . r 4 r 5. . . P 3 r 2 r 3 r 5 r 6. . . 40
DSwoosh Performance 41
Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences 42
Conclusion • ER is old and important problem • Our approach: generic • Confidences – challenging – two ways to tame: • thresholds • packages 43
Thanks. 44
- Swoosh a generic approach to entity resolution
- Swoosh a generic approach to entity resolution
- Swoosh a generic approach to entity resolution
- Manojit nandi
- Public interest entity
- Public interest entity
- Unary degree
- Simbol weak entity
- Information essential
- Identifying and non identifying adjective clauses
- Adjective clause identification
- Oag: toward linking large-scale heterogeneous entity graphs
- High resolution low resolution
- Erd supertype subtype
- Titles of single entities examples sentences
- Sfrs for small entities
- Eclipse generate jpa entities from database
- Business entities that operate in a duel market structure
- 10 rules on pronoun antecedents and agreement with examples
- Entities in software engineering
- Shape entities
- Financial statements of non corporate entities
- Accounting for variable interest entities
- Sebutkan dan jelaskan 10 tipe entitas yang dipasarkan
- Rda fields
- Comprises 70% of business entities in the united states
- Remove synoynm
- Partial specialization rule example
- Modeling data in the organization
- It is a struggle between two opposing forces
- Sample notes to financial statements for small entities
- Peer entities
- Weak entities
- Erm erd
- State government entities certified agreement 2020
- Accounting for variable interest entities
- Entity relationship diagram video rental store
- Erd simbol
- Associative entity là gì
- Entity vs relationship
- Entity framework core
- Medicare parts c and d general compliance training answers
- Name the entity that occupies space and has mass
- Intersection entity examples
- Associative entity