Entity Resolution at Scale The PSig algorithm Huon



































- Slides: 35
Entity Resolution at Scale The PSig algorithm Huon Wilson | Stellar. Graph www. data 61. csiro. au
Outline What is Entity Resolution? What is PSig? PSig on Apache Spark 2 | Entity Resolution at Scale | Huon Wilson
What is Entity Resolution? 3 | Entity Resolution at Scale | Huon Wilson
Matching Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 4 | Entity Resolution at Scale | Huon Wilson
Deduplicating Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 5 | Entity Resolution at Scale | Huon Wilson
Irony Entity Resolution Record Linkage Duplicate Detection Reference Deduplication Reconciliation www. cs. umd. edu/~getoor/Tutorials/ER_KDD 2013. pdf 6 | Entity Resolution at Scale | Huon Wilson Household Matching
Dataset Missing Values Name Address Phone Corruptions/Typos Trần, Minh 96 Cambridge Street SILVERDALE NSW 2752 (02) 4798 1225 Abbreviations Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 45514614 Taylor, Michael 66 Settlement Road 3854 Australia Tran, Minh Taylor, M 7 | Entity Resolution at Scale | Huon Wilson +61247981225 66 Ageston Road MUTAPILY QLD 4307
Naive: Quadratic Name Address Phone Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 (02) 4798 1225 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 45514614 Taylor, Michael 66 Settlement Road 3854 Australia score = 0. 1 score = 0. 2 score = 0. 6 score = 0. 2 Tran, Minh Taylor, M +61247981225 66 Ageston Road MUTAPILY QLD 4307 (n - 1) + (n - 2) + ⋯ + 2 + 1 = O(n 2) 8 | Entity Resolution at Scale | Huon Wilson
What is PSig? 9 | Entity Resolution at Scale | Huon Wilson
PSig - Overview Original Data Microblocking Scorable Links Scoring/Verification Y. Zhang, K. S. Ng, et al. , “Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases, ” arxiv. org/abs/1712. 09691, Dec. 2017. 10 | Entity Resolution at Scale | Huon Wilson Linked Record Graph
Microblocking - Candidate Signatures Initials: ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M Address: Last name: Address: Initials: Address: 66 Ageston Road MUTAPILY QLD 4307 11 | Entity Resolution at Scale | Huon Wilson Last name: Address: MT 66 MT Ageston Taylor 66 Taylor Ageston MT 66 MT Heyfield Taylor 66 Taylor Heyfield
Microblocking - Reverse Index ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M 66 Ageston Road MUTAPILY QLD 4307 12 | Entity Resolution at Scale | Huon Wilson MT 66 MT Ageston. . .
Microblocking - Signatures ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M 66 Ageston Road MUTAPILY QLD 4307 13 | Entity Resolution at Scale | Huon Wilson Signature IDs MT 66 { 1, 2, 3, 5 } MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . .
Scoring/Verification - Comparisons ID Name 1 Trần, Minh 2 Taylor, Mary 3 Taylor, Michael 4 Tran, Minh 5 Taylor, M 14 | Entity Resolution at Scale | Huon Wilson Signature IDs MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . .
Scoring/Verification - Linked Records Graph score = 0. 9 2 Taylor, Mary 5 Taylor, M score = 0. 3 score = 0. 4 3 Taylor, Michael 1 Trần, Minh score = 0. 6 15 | Entity Resolution at Scale | Huon Wilson Signature IDs MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . . Tran. . . { 1, 4 } . . . 4 Tran, Minh
PSig on Apache Spark 16 | Entity Resolution at Scale | Huon Wilson
Data. Frames > RDDs (for PSig) Records RDD (s) Data. Frame (s) Speed-up 10 M 668 164 4. 1 50 M 27900+ 394 71+ 70 cores, 455 G memory 17 | Entity Resolution at Scale | Huon Wilson
Data. Frames > RDDs Table representation More compact Catalyst query optimiser Generates specialized code 18 | Entity Resolution at Scale | Huon Wilson
Small Data. Frames are Deceiving Algorithm RDD Partitions Time (s) Default (12) 17 Data. Frame Default (200) 80 Data. Frame 20 10 K records 12 cores, 4 G memory 19 | Entity Resolution at Scale | Huon Wilson 12 spark. default. parallelism=12 spark. sql. shuffle. partitions=12
Profile: 10 K 20 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler
Profile: 10 K 21 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler
Profile: 1 M 22 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler
Partitioning Executor 1 Executor 2 2 Taylor, Mary 5 Taylor, M 3 Taylor, Michael 1 Trần, Minh 1 5 Trần, Taylor, Minh M 4 Tran, Minh 23 | Entity Resolution at Scale | Huon Wilson
Wrapping up 24 | Entity Resolution at Scale | Huon Wilson
Summary Entity Resolution: matching records PSig: microblocking, scoring Spark: Data. Frames, profile, partition 25 | Entity Resolution at Scale | Huon Wilson
Future Work Incremental & dynamic PSig Better analytics for Linked Record Graph Deep learning for signatures and scoring Graph information for graph datasets 26 | Entity Resolution at Scale | Huon Wilson
Stellar. Graph Scalable network graph analytics. github. com/stellargraph Booth here! stellargraph. io/careers Kevin Jung Bitcoin Ransomware Detection with Scalable Graph Machine Learning Pantelis Elinas Practical Geometric Deep Learning in Python 27 | Entity Resolution at Scale | Huon Wilson Tomorrow 10: 50 Tomorrow 3: 40
THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro. au w stellargraph. io www. data 61. csiro. au
PSig F 1 Y. Zhang, K. S. Ng, et al. , “Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases, ” arxiv. org/abs/1712. 09691, Dec. 2017. 29 | Entity Resolution at Scale | Huon Wilson
PSig Scaling 400 M records, 470 cores, 3 T RAM: 28 minutes 30 | Entity Resolution at Scale | Huon Wilson
Artificial Intelligence: Australia’s Ethics Framework 31 | Entity Resolution at Scale | Huon Wilson consult. industry. gov. au
Profile: 10 K Data. Frame 32 | Entity Resolution at Scale | Huon Wilson
Profile: 10 K RDD 33 | Entity Resolution at Scale | Huon Wilson
Edge. Partition 2 D 34 | Entity Resolution at Scale | Huon Wilson
THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro. au w stellargraph. io www. data 61. csiro. au