Record Linkage A 10 Year Retrospective Chen Li

  • Slides: 29
Download presentation
Record Linkage: A 10 -Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Record Linkage: A 10 -Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1

Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University

Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003 2

How was the paper written? Two faculty working on different areas, plus 1 st

How was the paper written? Two faculty working on different areas, plus 1 st year Ph. D student

Chen’s Story: 2001 … 5

Chen’s Story: 2001 … 5

Data Integration Problems? Talking to medical doctors… 6

Data Integration Problems? Talking to medical doctors… 6

Example Table R Name SSN Table S Addr Name SSN Addr Jack Lemmon 430

Example Table R Name SSN Table S Addr Name SSN Addr Jack Lemmon 430 -871 -8294 Maple St Ton Hanks 234 -162 -1234 Main Street Harrison Ford 292 -918 -2913 Culver Blvd Kevin Spacey 928 -184 -2813 Frost Blvd Tom Hanks 234 -762 -1234 Main St Jack Lemon 430 -817 -8294 Maple Street … … … Q: Find records from different datasets that could be the same entity Chen Li 7

Sharad’s research Chen Li 8

Sharad’s research Chen Li 8

Liang’s story 1 st-year Ph. D student at UC Irvine Chen Li 9

Liang’s story 1 st-year Ph. D student at UC Irvine Chen Li 9

Challenges n How to define good similarity functions? n How to do matching efficiently?

Challenges n How to define good similarity functions? n How to do matching efficiently? Chen Li 10

Nested-loop? n n Not desirable for large data sets 5 hours for 30 K

Nested-loop? n n Not desirable for large data sets 5 hours for 30 K strings! 11

Our 2 -step approach n n Step 1: map strings (in a metric space)

Our 2 -step approach n n Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space 12

Advantages n Applicable to many metric similarity functions — n E. g. : Edit

Advantages n Applicable to many metric similarity functions — n E. g. : Edit distance Open to existing algorithms Mapping techniques — Join techniques — 13

Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space 14

Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space 14

Can it preserve distances? n n Use data set 1 (54 K names) as

Can it preserve distances? n n Use data set 1 (54 K names) as an example k=2, d=20 — Use k’=5. 2 to differentiate similar and dissimilar pairs. 15

Multi-attribute linkage n n n Example: title + name + year Different attributes have

Multi-attribute linkage n n n Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format: 16

Secret of the paper … 17

Secret of the paper … 17

18

18

Work since then … n Chen: efficiency n Sharad: quality 19

Work since then … n Chen: efficiency n Sharad: quality 19

Chen’s Work on Efficiency n Gram-based algorithms Indexing — Selection algorithms — Join algorithms

Chen’s Work on Efficiency n Gram-based algorithms Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation — n Trie-based algorithms — Instant search 20

The Flamingo Package http: //flamingo. ics. uci. edu/

The Flamingo Package http: //flamingo. ics. uci. edu/

Follow-up work in the community n Significant amount of work on approximate string queries

Follow-up work in the community n Significant amount of work on approximate string queries Selection — Join — 22

Make an impact? 23

Make an impact? 23

UCI People Search Chen Li 24

UCI People Search Chen Li 24

Psearch (2008) : 2 stories Chen Li 25

Psearch (2008) : 2 stories Chen Li 25

Fuzzy search 26

Fuzzy search 26

Location-based search www. omniplaces. com 27

Location-based search www. omniplaces. com 27

Research commercialization Chen Li 28

Research commercialization Chen Li 28

Lesson learned: Hands-on experiences important! Chen Li 29

Lesson learned: Hands-on experiences important! Chen Li 29