Entity Resolution at Scale The PSig algorithm Huon

  • Slides: 35
Download presentation
Entity Resolution at Scale The PSig algorithm Huon Wilson | Stellar. Graph www. data

Entity Resolution at Scale The PSig algorithm Huon Wilson | Stellar. Graph www. data 61. csiro. au

Outline What is Entity Resolution? What is PSig? PSig on Apache Spark 2 |

Outline What is Entity Resolution? What is PSig? PSig on Apache Spark 2 | Entity Resolution at Scale | Huon Wilson

What is Entity Resolution? 3 | Entity Resolution at Scale | Huon Wilson

What is Entity Resolution? 3 | Entity Resolution at Scale | Huon Wilson

Matching Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 4 | Entity Resolution at Scale |

Matching Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 4 | Entity Resolution at Scale | Huon Wilson

Deduplicating Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 5 | Entity Resolution at Scale |

Deduplicating Records Image: https: //thenounproject. com/search/? q=Database&i=1276526 5 | Entity Resolution at Scale | Huon Wilson

Irony Entity Resolution Record Linkage Duplicate Detection Reference Deduplication Reconciliation www. cs. umd. edu/~getoor/Tutorials/ER_KDD

Irony Entity Resolution Record Linkage Duplicate Detection Reference Deduplication Reconciliation www. cs. umd. edu/~getoor/Tutorials/ER_KDD 2013. pdf 6 | Entity Resolution at Scale | Huon Wilson Household Matching

Dataset Missing Values Name Address Phone Corruptions/Typos Trần, Minh 96 Cambridge Street SILVERDALE NSW

Dataset Missing Values Name Address Phone Corruptions/Typos Trần, Minh 96 Cambridge Street SILVERDALE NSW 2752 (02) 4798 1225 Abbreviations Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 45514614 Taylor, Michael 66 Settlement Road 3854 Australia Tran, Minh Taylor, M 7 | Entity Resolution at Scale | Huon Wilson +61247981225 66 Ageston Road MUTAPILY QLD 4307

Naive: Quadratic Name Address Phone Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 (02)

Naive: Quadratic Name Address Phone Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 (02) 4798 1225 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 45514614 Taylor, Michael 66 Settlement Road 3854 Australia score = 0. 1 score = 0. 2 score = 0. 6 score = 0. 2 Tran, Minh Taylor, M +61247981225 66 Ageston Road MUTAPILY QLD 4307 (n - 1) + (n - 2) + ⋯ + 2 + 1 = O(n 2) 8 | Entity Resolution at Scale | Huon Wilson

What is PSig? 9 | Entity Resolution at Scale | Huon Wilson

What is PSig? 9 | Entity Resolution at Scale | Huon Wilson

PSig - Overview Original Data Microblocking Scorable Links Scoring/Verification Y. Zhang, K. S. Ng,

PSig - Overview Original Data Microblocking Scorable Links Scoring/Verification Y. Zhang, K. S. Ng, et al. , “Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases, ” arxiv. org/abs/1712. 09691, Dec. 2017. 10 | Entity Resolution at Scale | Huon Wilson Linked Record Graph

Microblocking - Candidate Signatures Initials: ID Name Address 1 Trần, Minh 66 Cambridge Street

Microblocking - Candidate Signatures Initials: ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M Address: Last name: Address: Initials: Address: 66 Ageston Road MUTAPILY QLD 4307 11 | Entity Resolution at Scale | Huon Wilson Last name: Address: MT 66 MT Ageston Taylor 66 Taylor Ageston MT 66 MT Heyfield Taylor 66 Taylor Heyfield

Microblocking - Reverse Index ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE

Microblocking - Reverse Index ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M 66 Ageston Road MUTAPILY QLD 4307 12 | Entity Resolution at Scale | Huon Wilson MT 66 MT Ageston. . .

Microblocking - Signatures ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW

Microblocking - Signatures ID Name Address 1 Trần, Minh 66 Cambridge Street SILVERDALE NSW 2752 2 Taylor, Mary 66 Ageston Rd MUTDAPILLY Queensland 4307 3 Taylor, Michael 66 Heyfield Road 3854 Australia 4 Tran, Minh 5 Taylor, M 66 Ageston Road MUTAPILY QLD 4307 13 | Entity Resolution at Scale | Huon Wilson Signature IDs MT 66 { 1, 2, 3, 5 } MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . .

Scoring/Verification - Comparisons ID Name 1 Trần, Minh 2 Taylor, Mary 3 Taylor, Michael

Scoring/Verification - Comparisons ID Name 1 Trần, Minh 2 Taylor, Mary 3 Taylor, Michael 4 Tran, Minh 5 Taylor, M 14 | Entity Resolution at Scale | Huon Wilson Signature IDs MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . .

Scoring/Verification - Linked Records Graph score = 0. 9 2 Taylor, Mary 5 Taylor,

Scoring/Verification - Linked Records Graph score = 0. 9 2 Taylor, Mary 5 Taylor, M score = 0. 3 score = 0. 4 3 Taylor, Michael 1 Trần, Minh score = 0. 6 15 | Entity Resolution at Scale | Huon Wilson Signature IDs MT Ageston { 2, 5 } Taylor 66 { 2, 3, 5 } Taylor Ageston { 2, 5 } . . . Tran. . . { 1, 4 } . . . 4 Tran, Minh

PSig on Apache Spark 16 | Entity Resolution at Scale | Huon Wilson

PSig on Apache Spark 16 | Entity Resolution at Scale | Huon Wilson

Data. Frames > RDDs (for PSig) Records RDD (s) Data. Frame (s) Speed-up 10

Data. Frames > RDDs (for PSig) Records RDD (s) Data. Frame (s) Speed-up 10 M 668 164 4. 1 50 M 27900+ 394 71+ 70 cores, 455 G memory 17 | Entity Resolution at Scale | Huon Wilson

Data. Frames > RDDs Table representation More compact Catalyst query optimiser Generates specialized code

Data. Frames > RDDs Table representation More compact Catalyst query optimiser Generates specialized code 18 | Entity Resolution at Scale | Huon Wilson

Small Data. Frames are Deceiving Algorithm RDD Partitions Time (s) Default (12) 17 Data.

Small Data. Frames are Deceiving Algorithm RDD Partitions Time (s) Default (12) 17 Data. Frame Default (200) 80 Data. Frame 20 10 K records 12 cores, 4 G memory 19 | Entity Resolution at Scale | Huon Wilson 12 spark. default. parallelism=12 spark. sql. shuffle. partitions=12

Profile: 10 K 20 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Profile: 10 K 20 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Profile: 10 K 21 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Profile: 10 K 21 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Profile: 1 M 22 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Profile: 1 M 22 | Entity Resolution at Scale | Huon Wilson github. com/jvm-profiling-tools/async-profiler

Partitioning Executor 1 Executor 2 2 Taylor, Mary 5 Taylor, M 3 Taylor, Michael

Partitioning Executor 1 Executor 2 2 Taylor, Mary 5 Taylor, M 3 Taylor, Michael 1 Trần, Minh 1 5 Trần, Taylor, Minh M 4 Tran, Minh 23 | Entity Resolution at Scale | Huon Wilson

Wrapping up 24 | Entity Resolution at Scale | Huon Wilson

Wrapping up 24 | Entity Resolution at Scale | Huon Wilson

Summary Entity Resolution: matching records PSig: microblocking, scoring Spark: Data. Frames, profile, partition 25

Summary Entity Resolution: matching records PSig: microblocking, scoring Spark: Data. Frames, profile, partition 25 | Entity Resolution at Scale | Huon Wilson

Future Work Incremental & dynamic PSig Better analytics for Linked Record Graph Deep learning

Future Work Incremental & dynamic PSig Better analytics for Linked Record Graph Deep learning for signatures and scoring Graph information for graph datasets 26 | Entity Resolution at Scale | Huon Wilson

Stellar. Graph Scalable network graph analytics. github. com/stellargraph Booth here! stellargraph. io/careers Kevin Jung

Stellar. Graph Scalable network graph analytics. github. com/stellargraph Booth here! stellargraph. io/careers Kevin Jung Bitcoin Ransomware Detection with Scalable Graph Machine Learning Pantelis Elinas Practical Geometric Deep Learning in Python 27 | Entity Resolution at Scale | Huon Wilson Tomorrow 10: 50 Tomorrow 3: 40

THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro.

THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro. au w stellargraph. io www. data 61. csiro. au

PSig F 1 Y. Zhang, K. S. Ng, et al. , “Scalable Entity Resolution

PSig F 1 Y. Zhang, K. S. Ng, et al. , “Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases, ” arxiv. org/abs/1712. 09691, Dec. 2017. 29 | Entity Resolution at Scale | Huon Wilson

PSig Scaling 400 M records, 470 cores, 3 T RAM: 28 minutes 30 |

PSig Scaling 400 M records, 470 cores, 3 T RAM: 28 minutes 30 | Entity Resolution at Scale | Huon Wilson

Artificial Intelligence: Australia’s Ethics Framework 31 | Entity Resolution at Scale | Huon Wilson

Artificial Intelligence: Australia’s Ethics Framework 31 | Entity Resolution at Scale | Huon Wilson consult. industry. gov. au

Profile: 10 K Data. Frame 32 | Entity Resolution at Scale | Huon Wilson

Profile: 10 K Data. Frame 32 | Entity Resolution at Scale | Huon Wilson

Profile: 10 K RDD 33 | Entity Resolution at Scale | Huon Wilson

Profile: 10 K RDD 33 | Entity Resolution at Scale | Huon Wilson

Edge. Partition 2 D 34 | Entity Resolution at Scale | Huon Wilson

Edge. Partition 2 D 34 | Entity Resolution at Scale | Huon Wilson

THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro.

THANK YOU Graph Analytics Engineering Huon Wilson Senior Engineer e Huon. Wilson@data 61. csiro. au w stellargraph. io www. data 61. csiro. au