Record Linkage A Database Approach Divesh Srivastava ATT

Record Linkage: A Database Approach Divesh Srivastava AT&T Labs-Research http: //www. research. att. com/~divesh/

Outline n Motivation n Data quality, applications n Similarity measures n Efficient algorithms for approximate join 10/27/2020 2

Data Quality: Status n Pervasive problem in large databases n n Inconsistency with reality: 2% of records obsolete in customer files in 1 month (deaths, name changes, etc) [DWI 02] Pricing anomalies : UA tickets selling for $5, 1 GB of memory selling for $19. 99 at amazon. com n Massive financial impact n n $611 B/year loss in US due to poor customer data [DWI 02] $2. 5 B/year loss due to incorrect prices in retail DBs [E 00] n Commercial tools: specialized, rule-based, programmatic 10/27/2020 3

How are Such Problems Created? n Human factors n n Incorrect data entry Ambiguity during data transformations n Application factors n n Erroneous applications populating databases Faulty database design (constraints not enforced) n Obsolence n 10/27/2020 Real-world is dynamic 4

Application: Merging Lists n Application: merge address lists (customer lists, company lists) to avoid redundancy n Current status: “standardize”, different values treated as distinct for analysis n Lot of heterogeneity n Need approximate joins n Relevant technologies n n 10/27/2020 Approximate joins Clustering/partitioning 5

Application: Homeland Security n Application: correlate airline passenger data with homeland security data for no-fly lists n Current status: “match” on name, deny boarding n Use more match attributes n Obtain more information n Relevant technologies n n 10/27/2020 Schema mappings Approximate joins 6

Record Linkage: Tip of the Iceberg n An approximate join of R 1 and R 2 is n A subset of the cartesian product of R 1 and R 2 n “Matching” specified attributes of R 1 and R 2 n Labeled with a similarity score > t > 0 Record Linkage Missing values Time series anomalies Integrity violations n Clustering/partitioning of R: operates on the approximate join of R with itself. 10/27/2020 7

Outline n Motivation n Similarity measures n n Edit distance TF-IDF cosine similarity n Efficient algorithms for approximate join 10/27/2020 8

Similarity Measures: Classification Token based Edit Based Soundex, Levenshtein/edit distance Jaro/Jaro-Winkler TF-IDF cosine similarity Jaccard Coefficient Probabilistic models FMS Hybrid 10/27/2020 9

Edit Distance n Character operations: I (insert), D (delete), R (replace) n Unit cost operations n Given two strings s, t, edit(s, t): n n n Minimum cost sequence of operations to transform s to t Suitable for common typing mistakes Example: edit(Error, Eror) = 1, edit(great, grate) = 2 n Folklore dynamic programming algorithm to compute edit(s, t) n n 10/27/2020 Computation and decision problem: O(|s|. |t|) NP-complete for several variants (e. g. , weighted) 10

Edit Distance: Issues n Problematic for specific domains n n AT&T Corporation vs AT&T IBM Corporation vs AT&T Corporation n Potential solution: edit distance with affine gaps n Allow sequences of mismatched characters (gaps) in the alignment of two strings cost(g) = S+E*L, S is cost of opening a gap, E (< S) is cost of extending the gap, L is length of a gap Similar dynamic programming algorithm 10/27/2020 11 n n

Token Based Measures n Varying semantics of “token” n n Words: “AT&T Corporation” → “AT&T”, “Corporation” Q-grams: “AT&”, “T&T”, “&T_”, “T_C”, “_Co”, “Cor”, “orp”, “rpo”, “por”, “ora”, “rat”, “ati”, “tio”, “ion” n Assess similarity by manipulating sets of tokens n n 10/27/2020 Given sets of tokens S, T, Jaccard(S, T) = |S T|/|S T| TF-IDF cosine similarity next 12

TF-IDF Cosine Similarity n Token frequency (TF), inverse database frequency (IDF) n n n Borrowed from traditional IR approaches TF-IDF of a “token” in a “record”: log (1+tf) * log (1+ N/Nt) Intuitively, a rare token in the database is important, a frequent token in a record is important n Cosine similarity n n n 10/27/2020 Record → sparse weighted vector of high dimensionality, normalized to have L 2 length = 1 Let a, b be two records; Sa, Sb the sets of their tokens; W(t, Sa), W(t, Sb) the weights of token t in Sa and Sb Cosine(a, b) = ∑ W(t, Sa)*W(t, Sb) 13

TF-IDF Cosine Similarity n Suitable to assess similarity n n n Low weight for “Corporation”, high weights for “AT&T”, “IBM” Cosine(“AT&T”, “AT&T Corporation”) will be high Cosine(“AT&T Corporation”, “IBM Corporation”) will be low n Via q-grams, can capture small typing mistakes n n 10/27/2020 “jaccard” vs “jacard” → {jac, acc, cca, car, ard} vs {jac, aca, car, ard} Common tokens “jac”, “car”, “ard” would be enough to result in high value of Cosine(“jaccard”, “jacard”) 14

Historical Timeline Jaccard coefficient Levenshtein/edit TF-IDF cosine distance similarity KL Divergence Soundex encoding 1901 1918 10/27/2020 Fellegi Sunter 1951 1965 1969 Jaro 1983 -9 FMS Winkler 1999 2003 15

Outline n Motivation n Similarity measures n Efficient algorithms for approximate join n 10/27/2020 Use traditional join methods Extend traditional join methods Commercial systems 16

Approximate Joins: Baseline + Goal n An approximate join of R 1(A 1, …, An) and R 2(B 1, …, Bm) is n n n A subset of the cartesian product of R 1 and R 2 “Matching” specified attributes Ai 1, . . . , Aik with Bi 1, …, Bik Labeled with a similarity score > t > 0 n Naïve method: for each record pair, compute similarity score n I/O and CPU intensive, not scalable to millions of records n Goal: reduce O(n 2) cost to O(n*w), where w << n n n 10/27/2020 Reduce number of pairs on which similarity is computed Take advantage of efficient relational join methods 17

Historical Timelines Index NL Join Sort-Merge Join Big. Match Band Join Merge/ Purge Fast. Map 1977 1991 1995 Probe count Union/find for clustering Spatial join 1997 1998 1991 10/27/2020 Dimension hierarchies 2002 Q-gram set join 1995 1998 SSJoin String. Map WHIRL Approx. string edit distance Multi-relational approx joins 2001 2003 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003 2004 2005 2006 18

Sorted Neighborhood Method [HS 95] n Goal: bring matching records close to each other in linear list n Background: duplicate elimination [BD 83], band join [DNS 91] n Methodology: domain-specific, arbitrary similarity n n Compute discriminating key per record, sort records Slide fixed size window through sorted list, match in window Use OPS 5 rules (equational theory) to determine match Multiple passes with small windows, based on distinct keys n Lesson: multiple “cheap” passes faster than an “expensive” one 10/27/2020 19

Sorted Neighborhood Method [HS 95] n Goal: bring matching records close to each other in linear list r 1 n Example: r 2 yes r 3 ID Name SS DOB ZIP r 1 Smith, John 123 -45 1960/08/24 07932 r 2 Smyth, Jon 123 -45 1961/08/24 07932 r 3 Smith, John 312 -54 1995/07/25 98301 r 4 Smith, J. 723 -45 1960/08/24 98346 r 5 Smith, J. 456 -78 1975/12/11 98346 10/27/2020 ZIP. Name[1. . 3] r 4 r 5 no 20

Sorted Neighborhood Method [HS 95] n Goal: bring matching records close to each other in linear list r 1 n Example: r 2 yes r 3 ID Name SS DOB ZIP r 1 Smith, John 123 -45 1960/08/24 07932 r 2 Smyth, Jon 123 -45 1961/08/24 07932 r 3 Smith, John 312 -54 1995/07/25 98301 r 4 Smith, J. 723 -45 1960/08/24 98346 r 5 Smith, J. 456 -78 1975/12/11 98346 ZIP. Name[1. . 3] r 4 r 5 r 1 DOB. Name[1. . 3] r 4 no yes r 2 r 5 n Blocking is a special case 10/27/2020 r 3 21

Big. Match [Y 02] n Goal: block/index matching records, based on multiple keys n Background: indexed nested loop join [BE 77] n Methodology: domain-specific, Jaro-Winkler similarity n n Store smaller table (100 M) in main memory (4 GB) Create indexes for each set of grouping/blocking criteria Scan larger table (4 B), repeatedly probe smaller table Avoids multiple matches of the same pair n Lesson: traditional join technique can speed up approximate join 10/27/2020 22

Big. Match [Y 02] n Goal: block/index matching records, based on multiple keys n Example: record from outer table Smith, John 10/27/2020 inner table SS. Name[1. . 2] yes no 123 -45 1960/08/24 98346 ID Name SS DOB ZIP r 1 Smith, John 123 -45 1960/08/24 07932 r 2 Smyth, Jon 123 -45 1961/08/24 07932 r 3 Smith, John 312 -54 1995/07/25 98301 r 4 Smith, J. 723 -45 1960/08/24 98346 r 5 Smith, J. 456 -78 1975/12/11 98346 23

Big. Match [Y 02] n Goal: block/index matching records, based on multiple keys n Example: record from outer table Smith, John inner table SS. Name[1. . 2] yes no 123 -45 1960/08/24 ZIP. Name[1. . 3] 98346 yes no ID Name SS DOB ZIP r 1 Smith, John 123 -45 1960/08/24 07932 r 2 Smyth, Jon 123 -45 1961/08/24 07932 r 3 Smith, John 312 -54 1995/07/25 98301 r 4 Smith, J. 723 -45 1960/08/24 98346 r 5 Smith, J. 456 -78 1975/12/11 98346 n Avoids multiple matches of the same pair 10/27/2020 24

Historical Timelines Index NL Join Sort-Merge Join Big. Match Band Join Merge/ Purge Fast. Map 1977 1991 1995 Probe count Union/find for clustering Spatial join 1997 1998 1991 10/27/2020 Dimension hierarchies 2002 Q-gram set join 1995 1998 SSJoin String. Map WHIRL Approx. string edit distance Multi-relational approx joins 2001 2003 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003 2004 2005 2006 25

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes n Background: combinatorial pattern matching [JU 91] n Methodology: domain-independent, edit distance similarity n n Extract set of all overlapping q-grams Q(s) from string s ED(s 1, s 2) ≤ d |Q(s 1) Q(s 2)| max(|s 1|, |s 2|) - (d-1)*q - 1 Cheap filters (length, count, position) to prune non-matches Pure SQL solution: cost-based join methods n Lesson: reduce approximate join to aggregated set intersection 10/27/2020 26

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes n Example: ID Name r 1 Srivastava r 2 Shrivastava r 3 Shrivastav 10/27/2020 27

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes n Example: ID Name 3 -grams r 1 Srivastava ##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ r 2 Shrivastava ##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ r 3 Shrivastav n n 10/27/2020 ED(s 1, s 2) ≤ d |Q(s 1) Q(s 2)| max(|s 1|, |s 2|) - (d-1)*q - 1 ED(r 1, r 2) = 1, |Q(r 1) Q(r 2)| = 10 28

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes n Example: ID Name r 1 Srivastava r 2 Shrivastava r 3 Shrivastav n n 10/27/2020 3 -grams ##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$ ##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$ ED(s 1, s 2) ≤ d |Q(s 1) Q(s 2)| max(|s 1|, |s 2|) - (d-1)*q - 1 ED(r 1, r 2) = 2, |Q(r 1) Q(r 2)| = 7 29

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes n Example: ID Name r 1 Srivastava r 2 Shrivastava r 3 Shrivastav 10/27/2020 Q ID Qg r 1 ##s r 3 ##s r 1 #sr r 3 #sh r 1 sri r 3 shr r 1 riv r 3 hri r 1 iva r 3 riv r 1 vas r 3 iva r 1 ast r 3 vas r 1 sta r 3 ast r 1 tav r 3 sta r 1 ava r 3 tav r 1 va$ r 3 av$ r 1 a$$ r 3 v$$ 30

Q-gram Set Join [GIJ+01] n Goal: compute thresholded edit distance join on string attributes Q n Example: ID Name r 1 Srivastava r 2 Shrivastava r 3 Shrivastav 10/27/2020 SELECT Q 1. ID, Q 2. ID FROM Q AS Q 1, Q AS Q 2 WHERE Q 1. Qg = Q 2. Qg GROUP BY Q 1. ID, Q 2. ID HAVING COUNT(*) > T ID Qg r 1 ##s r 3 ##s r 1 #sr r 3 #sh r 1 sri r 3 shr r 1 riv r 3 hri r 1 iva r 3 riv r 1 vas r 3 iva r 1 ast r 3 vas r 1 sta r 3 ast r 1 tav r 3 sta r 1 ava r 3 tav r 1 va$ r 3 av$ r 1 a$$ r 3 v$$ 31

Fuzzy Match Similarity [CGGM 03] n Goal: identify K “closest” reference records in on-line setting n Background: IDF weighted cosine similarity, WHIRL [C 98] n Methodology: domain-independent, IDF+ED similarity n n Similarity metric based on IDF weighted token edit distance Approximate similarity metric using Jaccard on q-gram sets Small error tolerant index table, sharing of minhash q-grams Optimistic short circuiting exploits large token IDF weights n Lesson: IDF weighting useful to capture erroneous tokens 10/27/2020 32

Fuzzy Match Similarity [CGGM 03] n Goal: identify K “closest” reference records in on-line setting reference table n Example: ID best ED match input record Beoing Corporation 10/27/2020 Seattle WA Org. Name City State ZIP r 1 Boeing Company Seattle WA 98004 r 2 Bon Corporation Seattle WA 98014 r 3 Companions Seattle WA 98024 98004 33

Fuzzy Match Similarity [CGGM 03] n Goal: identify K “closest” reference records in on-line setting n Example: reference table best FMS match input record Beoing Corporation 10/27/2020 Seattle WA ID Org. Name City State ZIP r 1 Boeing Company Seattle WA 98004 r 2 Bon Corporation Seattle WA 98014 r 3 Companions Seattle WA 98024 98004 34

Fuzzy Match Similarity [CGGM 03] n Goal: identify K “closest” reference records in on-line setting reference table n Example: ID input record Beoing Corporation Seattle WA 98004 [eoi, ing] [orp, ati] [sea, ttl] [wa] [980, 004] all minhash q-grams 10/27/2020 Org. Name City State ZIP r 1 Boeing Company Seattle WA 98004 r 2 Bon Corporation Seattle WA 98014 r 3 Companions Seattle WA 98024 ETI table Qg MHC Col Freq TIDList ing 2 1 1 {r 1} orp 1 1 1 {r 2} sea 1 2 3 {r 1, r 2, r 3} 004 2 4 1 {r 1} 35

Fuzzy Match Similarity [CGGM 03] n Goal: identify K “closest” reference records in on-line setting reference table n Example: ID input record Beoing Corporation Seattle WA 98004 [eoi, ing] [orp, ati] [sea, ttl] [wa] [980, 004] optimistic short circuiting 10/27/2020 Org. Name City State ZIP r 1 Boeing Company Seattle WA 98004 r 2 Bon Corporation Seattle WA 98014 r 3 Companions Seattle WA 98024 ETI table Qg MHC Col Freq TIDList ing 2 1 1 {r 1} orp 1 1 1 {r 2} sea 1 2 3 {r 1, r 2, r 3} 004 2 4 1 {r 1} 36

Historical Timelines Index NL Join Sort-Merge Join Big. Match Band Join Merge/ Purge Fast. Map 1977 1991 1995 Probe count Union/find for clustering Spatial join 1997 1998 1991 10/27/2020 Dimension hierarchies 2002 Q-gram set join 1995 1998 SSJoin String. Map WHIRL Approx. string edit distance Multi-relational approx joins 2001 2003 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003 2004 2005 2006 37

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate n Background: IR and probe count using inverted index [TF 95] n Methodology: domain-independent, weighted set similarity n n Map a string to a set of elements (words, q-grams, etc. ) Build inverted lists on individual set elements Optimization: process skewed lists in increasing size order Optimization: sort lists in decreasing order of record sizes n Lesson: IR query optimizations useful for approximate joins 10/27/2020 38

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate Inverted index n Example: ID SVA r 1 {##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 2 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 3 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$} 10/27/2020 SE IDs ##s r 1, r 2, r 3 #sr r 1 #sh r 2, r 3 sri r 1 shr r 2, r 3 hri r 2, r 3 riv r 1, r 2, r 3 … … tav r 1, r 2, r 3 ava r 1, r 2 … … v$$ r 3 39

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate Inverted index n Example: ID SVA r 1 {##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 2 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 3 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$} n Sort lists in decreasing order of record sizes 10/27/2020 SE IDs ##s r 2, r 1, r 3 #sr r 1 #sh r 2, r 3 sri r 1 shr r 2, r 3 hri r 2, r 3 riv r 2, r 1, r 3 … … tav r 2, r 1, r 3 ava r 2, r 1 … … v$$ r 3 40

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate Inverted index n Example: ID SVA r 1 {##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 2 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 3 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$} n Process skewed lists in increasing size order 10/27/2020 SE IDs ##s r 2, r 1, r 3 #sr r 1 #sh r 2, r 3 sri r 1 shr r 2, r 3 hri r 2, r 3 riv r 2, r 1, r 3 … … tav r 2, r 1, r 3 ava r 2, r 1 … … v$$ r 3 41

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate Inverted index n Example: ID SVA r 1 {##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 2 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 3 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$} n Process skewed lists in increasing size order 10/27/2020 SE IDs ##s r 2, r 1, r 3 #sr r 1 #sh r 2, r 3 sri r 1 shr r 2, r 3 hri r 2, r 3 riv r 2, r 1, r 3 … … tav r 2, r 1, r 3 ava r 2, r 1 … … v$$ r 3 42

Probe-Cluster: Set Joins [SK 04] n Goal: generic algorithm for set join based on similarity predicate Inverted index n Example: ID SVA r 1 {##s, #sr, sri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 2 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, ava, va$, a$$} r 3 {##s, #sh, shr, hri, riv, iva, vas, ast, sta, tav, av$, v$$} n Process skewed lists in increasing size order 10/27/2020 SE IDs ##s r 2, r 1, r 3 #sr r 1 #sh r 2, r 3 sri r 1 shr r 2, r 3 hri r 2, r 3 riv r 2, r 1, r 3 … … tav r 2, r 1, r 3 ava r 2, r 1 … … v$$ r 3 43

SSJoin: Relational Operator [CGK 06] n Goal: generic algorithm for set join based on similarity predicate n Background: Probe-Cluster, dimension hierarchies, q-gram join n Methodology: domain-independent, weighted set similarity n n Compare strings based on sets associated with each string Problem: Overlap(s 1, s 2) ≥ threshold Optimization: high set overlap → overlap of ordered subsets SQL implementation using equijoins, cost-based plans n Lesson: Generic algorithms can be supported in DBMS 10/27/2020 44

SSJoin: Relational Operator [CGK 06] n Goal: generic algorithm for set join based on similarity predicate Q n Example: ID Name r 1 Srivastava r 4 Srivastav 10/27/2020 SELECT Q 1. ID, Q 2. ID FROM Q AS Q 1, Q AS Q 2 WHERE Q 1. Qg = Q 2. Qg GROUP BY Q 1. ID, Q 2. ID HAVING COUNT(*) > 8 ID Qg r 1 ##s r 4 ##s r 1 #sr r 4 #sr r 1 sri r 4 sri r 1 riv r 4 riv r 1 iva r 4 iva r 1 vas r 4 vas r 1 ast r 4 ast r 1 sta r 4 sta r 1 tav r 4 tav r 1 ava r 4 av$ r 1 va$ r 4 v$$ r 1 a$$ 45

SSJoin: Relational Operator [CGK 06] n Goal: generic algorithm for set join based on similarity predicate Q n Example: ID Name r 1 Srivastava r 4 Srivastav SELECT Q 1. ID, Q 2. ID FROM Q AS Q 1, Q AS Q 2 WHERE Q 1. Qg = Q 2. Qg GROUP BY Q 1. ID, Q 2. ID HAVING COUNT(*) > 8 ID Qg r 1 tav r 4 ##s r 1 ava r 4 #sr r 1 va$ r 4 sri r 1 a$$ r 4 riv r 4 iva r 4 vas r 4 ast r 4 sta r 4 tav r 4 av$ r 4 v$$ n Optimization: use any 4 q-grams of r 1 with all of r 4 10/27/2020 46

SSJoin: Relational Operator [CGK 06] n Goal: generic algorithm for set join based on similarity predicate Q n Example: ID Name r 1 Srivastava r 4 Srivastav SELECT Q 1. ID, Q 2. ID FROM Q AS Q 1, Q AS Q 2 WHERE Q 1. Qg = Q 2. Qg GROUP BY Q 1. ID, Q 2. ID HAVING COUNT(*) > 8 n Optimization: use any 3 q-grams of r 4 10/27/2020 ID Qg r 1 ##s r 4 sri r 1 #sr r 4 av$ r 1 sri r 4 v$$ r 1 riv r 1 iva r 1 vas r 1 ast r 1 sta r 1 tav r 1 ava r 1 va$ r 1 a$$ 47

SSJoin: Relational Operator [CGK 06] n Goal: generic algorithm for set join based on similarity predicate Q n Example: ID Name r 1 Srivastava r 4 Srivastav SELECT Q 1. ID, Q 2. ID FROM Q AS Q 1, Q AS Q 2 WHERE Q 1. Qg = Q 2. Qg GROUP BY Q 1. ID, Q 2. ID HAVING COUNT(*) > 8 ID Qg r 1 iva r 4 iva r 1 ast r 4 ast r 1 ava r 4 av$ r 1 a$$ n Optimization: use ordered 4 q-grams of r 1 and 3 q-grams of r 4 n Suggested ordering: based on decreasing IDF weights 10/27/2020 48

Historical Timelines Index NL Join Sort-Merge Join Big. Match Band Join Merge/ Purge Fast. Map 1977 1991 1995 Probe count Union/find for clustering Spatial join 1997 1998 1991 10/27/2020 Dimension hierarchies 2002 Q-gram set join 1995 1998 SSJoin String. Map WHIRL Approx. string edit distance Multi-relational approx joins 2001 2003 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003 2004 2005 2006 49

Commercial Systems: Comparisons Commercial System Record Linkage Methodology Distance Metrics Supported Domain-Specific Matching Additional Data Quality Support SQL Server Integration Services 2005 Fuzzy Lookup; Fuzzy Grouping; uses Error Tolerant Index customized, domainindependent: edit distance; number, order, freq. of tokens unknown Oracle. BI Warehouse Builder 10 g. R 2 “Paris” match-merge rules; deterministic and probabilistic matching Jaro-Winkler; double metaphone name & address parse; match; standardize: 3 rd party vendors data profiling; data rules; data auditors IBM’s Entity Analytic Solutions, Quality. Stage probabilistic matching (information content); multi-pass blocking; rules-based merging wide variety of fuzzy matching functions name recognition; identity resolution; relationship resolution: EAS data profiling; standardization; trends and anomalies; 10/27/2020 50

Open Problem: Benchmarking n Issue: many algorithms and similarity measures, no benchmarks n Background n Comparing quality of different similarity measures [CRF 03] n Goal: develop standard benchmarks (queries, data generation) 10/27/2020 51

Conclusions n Record linkage is critical when data quality is poor n n Similarity metrics Efficient sub-quadratic approximate join algorithms n Wealth of challenging technical problems n n 10/27/2020 Sophisticated similarity metrics, massive data sets Important to work with real datasets 52

References n n n [ACG 02] Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti: Eliminating Fuzzy Duplicates in Data Warehouses. VLDB 2002: 586 -597 [BD 83] Dina Bitton, David J. De. Witt: Duplicate Record Elimination in Large Data Files. ACM Trans. Database Syst. 8(2): 255 -265 (1983) [BE 77] Mike W. Blasgen, Kapali P. Eswaran: Storage and Access in Relational Data Bases. IBM Systems Journal 16(4): 362 -377 (1977) [BG 04] Indrajit Bhattacharya, Lise Getoor: Iterative record linkage for cleaning and integration. DMKD 2004: 11 -18 [C 98] William W. Cohen: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity. SIGMOD Conference 1998: 201 -212 [C 00] William W. Cohen: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3): 288 -321 (2000) [CCZ 02] Peter Christen, Tim Churches, Xi Zhu: Probabilistic name and address cleaning and standardization. Australasian Data Mining Workshop 2002. [CGGM 04] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani: Robust and Efficient Fuzzy Match for Online Data Cleaning. SIGMOD Conference 2003: 313 -324 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918 -920 [CGK 06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A primitive operator for similarity joins in data cleaning. ICDE 2006. [CGM 05] Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005: 865 -876 [CRF 03] William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003: 73 -78 10/27/2020 53

References n n n [DJ 03] Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning John Wiley 2003 [DNS 91] David J. De. Witt, Jeffrey F. Naughton, Donovan A. Schneider: An Evaluation of Non-Equijoin Algorithms. VLDB 1991: 443 -452 [DWI 02] Data Warehousing Institute report 2002 [E 00] Larry English: Plain English on Data Quality: Information Quality Management: The Next Frontier. DM Review Magazine: April 2000. http: //www. dmreview. com/article_sub. cfm? article. Id=2073 [FL 95] Christos Faloutsos, King-Ip Lin: Fast. Map: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995: 163 -174 [FS 69] I. Fellegi, A. Sunter: A theory of record linkage. Journal of the American Statistical Association, Vol 64. No 328, 1969 [G 98] D. Gusfield: Algorithms on strings, trees and sequences. Cambridge university press 1998 [GFS+01] Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: Declarative Data Cleaning: Language, Model, and Algorithms. VLDB 2001: 371 -380 [GIJ+01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava: Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491 -500 [GIKS 03] Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava: Text joins in an RDBMS for web data integration. WWW 2003: 90 -101 [GKMS 04] S. Guha, N. Koudas, A. Marathe, D. Srivastava : Merging the results of approximate match operations. VLDB 2004. [GKR 98] David Gibson, Jon M. Kleinberg, Prabhakar Raghavan: Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB 1998: 311 -322 10/27/2020 54

References n n n [HS 95] Mauricio A. Hernández, Salvatore J. Stolfo: The Merge/Purge Problem for Large Databases. SIGMOD Conference 1995: 127 -138 [HS 98] Gísli R. Hjaltason, Hanan Samet: Incremental Distance Join Algorithms for Spatial Databases. SIGMOD Conference 1998: 237 -248 [J 89] M. A. Jaro: Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84: 414 -420. [JLM 03] Liang Jin, Chen Li, Sharad Mehrotra: Efficient Record Linkage in Large Data Sets. DASFAA 2003 [JU 91] Petteri Jokinen, Esko Ukkonen: Two Algorithms for Approximate String Matching in Static Texts. MFCS 1991: 240 -248 [KL 51] S. Kullback, R. Liebler : On information and sufficiency. The annals of mathematical statistics 22(1): 79 -86. 1959. [KMC 05] Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen: Exploiting Relationships for Domain. Independent Data Cleaning. SDM 2005 [KMS 04] Nick Koudas, Amit Marathe, Divesh Srivastava: Flexible String Matching Against Large Databases in Practice. VLDB 2004: 1078 -1086 [KMS 05] Nick Koudas, Amit Marathe, Divesh Srivastava: SPIDER: flexible matching in databases. SIGMOD Conference 2005: 876 -878 [LLL 00] Mong-Li Lee, Tok Wang Ling, Wai Lup Low: Intelli. Clean: a knowledge-based intelligent data cleaner. KDD 2000: 290 -294 [ME 96] Alvaro E. Monge, Charles Elkan: The Field Matching Problem: Algorithms and Applications. KDD 1996: 267 -270 10/27/2020 55

References n n n n n [ME 97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997 [RY 97] E. Ristad, P. Yianilos : Learning string edit distance. IEEE Pattern analysis and machine intelligence 1998. [S 83] Gerry Salton : Introduction to modern information retrieval. Mc. Graw Hill 1987. [SK 04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743 -754 [TF 95] Howard R. Turtle, James Flood: Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31(6): 831 -850 (1995) [TKF 01] S. Tejada, C. Knoblock, S. Minton : Learning object identification rules for information integration. Information Systems, Vol 26, No 8, 607 -633, 2001. [W 94] William E. Winkler: Advanced methods for record linkage. Proceedings of the section on survey research methods, American Statistical Association 1994: 467 -472 [W 99] William E. Winkler: The state of record linkage and current research problems. IRS publication R 99/04 (http: //www. census. gov/srd/www/byname. html) [Y 02] William E. Yancey: Big. Match: A program for extracting probable matches from a large file for record linkage. RRC 2002 -01. Statistical Research Division, U. S. Bureau of the Census. 10/27/2020 56