Selftuning in GraphBased Reference Disambiguation Rabia NurayTuran Dmitri
Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine DASFAA 2007, Bangkok, Thailand
Overview • Intro to Data Cleaning – Entity resolution • Rel. DC Framework – Past work • Adapting to data – The new part – Reduction to an Optimization problem – Linear programming • Experiments 12/19/2021 DASFAA 2007, Bangkok, Thailand 2
Data Cleaning Analysis on bad data leads to wrong conclusions 12/19/2021 DASFAA 2007, Bangkok, Thailand 3
Example of the problem: Cite. Seer top-K Suspicious entries – Lets go to DBLP website – which stores bibliographic entries of many CS authors – Lets check two people – “A. Gupta” – “L. Zhang” Cite. Seer: the top-k most cited authors 12/19/2021 DBLP DASFAA 2007, Bangkok, Thailand DBLP 4
Two Most Common Entity-Resolution Challenges Fuzzy lookup – reference disambiguation – match references to objects – list of all objects is given 12/19/2021 Fuzzy grouping – group together object representations, that correspond to the same object DASFAA 2007, Bangkok, Thailand 5
Standard Approach to Entity Resolution 12/19/2021 DASFAA 2007, Bangkok, Thailand 6
Overview • Intro to Data Cleaning ØRel. DC Framework – Past work • Adapting to data – The new part – Reduction to an Optimization problem – Linear programming • Experiments 12/19/2021 DASFAA 2007, Bangkok, Thailand 7
Rel. DC Framework 12/19/2021 DASFAA 2007, Bangkok, Thailand 8
Rel. DC Framework • Past work – SDM’ 05, TODS’ 06 • Domain-independent framework – Viewing the dataset as an Entity Relationship Graph – Analyzes paths in this graph • Solid theoretic foundation – Optimization problem • • Scales to large datasets Robust under uncertainty High disambiguation quality No Self-tuning – This paper solves this challenge 12/19/2021 DASFAA 2007, Bangkok, Thailand 9
Entity-Relationship Graph • • Choice node – For uncertain references – To encode options/possibilities yr 1, … yr. N • Among options yr 1, … yr. N – Pick the most strongly connected one – CAP principle – Analyze paths in G – that exist between xr and yrj, for all j “Connection strength” model – c(u, v), for nodes u and v in G – how strongly u and v are connected in G – Random. Walk-based – Fixed – Based on Intuition!!! – This paper, instead, learns such a model from data. – Use a model to measure connection strength 12/19/2021 DASFAA 2007, Bangkok, Thailand 10
Overview • Intro to Data Cleaning • Rel. DC Framework – Past work ØAdapting to data – The new part – Reduction to an Optimization problem – Linear programming • Experiments 12/19/2021 DASFAA 2007, Bangkok, Thailand 11
Adaptive Solution • Classify the found paths in the graph into a finite set of path types ST ={ T 1, T 2, …, TN} • If paths p 1 and p 2 are of the same type then they are treated as identical. • We can show the connection between nodes u and v with a path-type count vector: Tuv = { c 1, c 2, …, c. N} • If there is a way to associate path Ti to wi then connection strength will be: 12/19/2021 DASFAA 2007, Bangkok, Thailand 12
Problems to Answer • How will we classify the paths? • How will we associate each path type with a weight? 12/19/2021 DASFAA 2007, Bangkok, Thailand 13
Classifying Paths • Path Type Model (PTM): – Views each path as a sequence of edges – <e 1, e 2, e 3, …, en> – Each edge ei has a type Ei associated with it – Thus, can associate each path p with a string – <E 1, E 2, E 3, …, En> – Different strings correspond to different path types – Associate each string a weight • Different models are also possible 12/19/2021 DASFAA 2007, Bangkok, Thailand 14
Learning Path Weights : Optimization Problem • CAP Principle states that: – the right option will be better connected • Linear programming • Learn path types weight w’s. 12/19/2021 DASFAA 2007, Bangkok, Thailand 15
Final Solution • The value of c(xr, yrj)- c(xr, yrl) should be maximized for all r, l≠j • Then final solution: 12/19/2021 DASFAA 2007, Bangkok, Thailand 16
Example -Graph P 1= e 1 -e 3 -e 1 P 3= e 1 -e 2 -e 3 12/19/2021 P 2= e 1 -e 3 P 4= e 1 -e 2 -e 3 DASFAA 2007, Bangkok, Thailand 17
Example- Solution • w 1 =1 • w 3 = w 4 = 0 • w 2 can be anything between 0 and 1. 12/19/2021 DASFAA 2007, Bangkok, Thailand 18
Overview • Intro to Data Cleaning • Rel. DC Framework – Past work • Adapting to data – The new part – Reduction to an Optimization problem – Linear programming ØExperiments 12/19/2021 DASFAA 2007, Bangkok, Thailand 19
Experimental Setup Real. Mov: – movies (12 K) – people (22 K) – actors – directors – producers – studious (1 K) – producing – distributing Syn. Pub datasets: – many ds of five different types – emulation of Real. Pub – – publications (5 K) authors (1 K) organizations (25 K) departments (125 K) – ground truth is known Parameters – When looking for L-short simple paths, L = 5 – L is the path-length limit 12/19/2021 DASFAA 2007, Bangkok, Thailand 20
Experimental Results on Movies Parameters : -Fraction : fraction of uncertain references in the dataset -Each reference has 2 choices 12/19/2021 DASFAA 2007, Bangkok, Thailand 21
Experimental Results on Movies- II Number of options based on PMF Distribution 12/19/2021 DASFAA 2007, Bangkok, Thailand 22
Experimental Results on Syn. Pub Hybrid Model : • Random. Walk, PTM and the Hybrid Model have the same accuracy • Is Random. Walk the optimum model for Publications domain? 12/19/2021 DASFAA 2007, Bangkok, Thailand 23
Effect of Random Relationships in the Publications Domain 12/19/2021 DASFAA 2007, Bangkok, Thailand 24
Summary • Main Contribution – An adaptive solution for connection strength – Model learns the weights of different path types • Ongoing work – Using different models to learn the importance of paths in the connection strength – Use of standard machine learning techniques for learning: such as decision trees, etc… – Different ways to classify paths 12/19/2021 DASFAA 2007, Bangkok, Thailand 25
Contact Information • Rel. DC project – www. ics. uci. edu/~dvk/Rel. DC – www. itr-rescue. org (RESCUE) • Rabia Nuray-Turan (contact author) – www. ics. uci. edu/~rnuray • Dmitri V. Kalashnikov – www. ics. uci. edu/~dvk • Sharad Mehrotra – www. ics. uci. edu/~sharad 12/19/2021 DASFAA 2007, Bangkok, Thailand 26
Thank you ! DASFAA 2007, Bangkok, Thailand
- Slides: 27