Record Linkage A 10 Year Retrospective Chen Li
- Slides: 29
Record Linkage: A 10 -Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1
Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003 2
How was the paper written? Two faculty working on different areas, plus 1 st year Ph. D student
Chen’s Story: 2001 … 5
Data Integration Problems? Talking to medical doctors… 6
Example Table R Name SSN Table S Addr Name SSN Addr Jack Lemmon 430 -871 -8294 Maple St Ton Hanks 234 -162 -1234 Main Street Harrison Ford 292 -918 -2913 Culver Blvd Kevin Spacey 928 -184 -2813 Frost Blvd Tom Hanks 234 -762 -1234 Main St Jack Lemon 430 -817 -8294 Maple Street … … … Q: Find records from different datasets that could be the same entity Chen Li 7
Sharad’s research Chen Li 8
Liang’s story 1 st-year Ph. D student at UC Irvine Chen Li 9
Challenges n How to define good similarity functions? n How to do matching efficiently? Chen Li 10
Nested-loop? n n Not desirable for large data sets 5 hours for 30 K strings! 11
Our 2 -step approach n n Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space 12
Advantages n Applicable to many metric similarity functions — n E. g. : Edit distance Open to existing algorithms Mapping techniques — Join techniques — 13
Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space 14
Can it preserve distances? n n Use data set 1 (54 K names) as an example k=2, d=20 — Use k’=5. 2 to differentiate similar and dissimilar pairs. 15
Multi-attribute linkage n n n Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format: 16
Secret of the paper … 17
18
Work since then … n Chen: efficiency n Sharad: quality 19
Chen’s Work on Efficiency n Gram-based algorithms Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation — n Trie-based algorithms — Instant search 20
The Flamingo Package http: //flamingo. ics. uci. edu/
Follow-up work in the community n Significant amount of work on approximate string queries Selection — Join — 22
Make an impact? 23
UCI People Search Chen Li 24
Psearch (2008) : 2 stories Chen Li 25
Fuzzy search 26
Location-based search www. omniplaces. com 27
Research commercialization Chen Li 28
Lesson learned: Hands-on experiences important! Chen Li 29
- Linkage editor
- Chen chen berlin
- Link plus software
- A link plus software
- Anecdotal record vs running record
- Retrospective validation
- Proactive interference vs retroactive interference
- Retrospective studies pros and cons
- Cotrizine
- Speedboat retrospective
- Retrospective cohort study
- Retrospective slides
- Define labeling theory
- Norm kerth retrospective prime directive
- Vendor rebate accrual
- Retrospective cohort study vs case control
- Retrospective agile
- Retrospective cohort study vs prospective cohort study
- Prospective causal-comparative research
- Case series
- Nature of the risk
- Retrospective cohort study
- Retrospective cohort study vs case control
- Retrospective validation
- Retrospective memory
- What is a project retrospective
- Transactive goals develop:
- Retrospective aspect
- Accounting changes and error analysis
- Nursery leavers poem 2020