An Efficient Triebased Method for Approximate Entity Extraction
An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Dong Deng (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 2
Named Entity Recognition �Dictionary-based Entity Extraction Dictionary of Entities Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist. . . 9/19/2021 Documents 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst. Taste @ ICDE 2012 3
Automatically add the links �Wikipedia http: //en. wikipedia. org/wiki/Levenshtein_distance 9/19/2021 Taste @ ICDE 2012 4
Real-world Data is Rather Dirty! DBLP Complete Search �Typo in “author” Argyrios Zymnis Argyris Zymnis �Typo in “title” 9/19/2021 relaxed Taste @ ICDE 2012 related 5
Approximate Entity Extraction �Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities. �For example: Dictionary of Entities Isaac Newton Sigmund Freud physicist astronomer alchemist theologian economist sociologist. . . 9/19/2021 Documents Sigmund Freund was an Austrian psychiatrest who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalayst. Taste @ ICDE 2012 6
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 7
Edit Distance �ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. �For example: ED(marios, maras) = 2 Di, 0 = i, D 0, j = j, Di, j = min{Di-1, j + 1, Di, j-1 + 1, Di-1, j-1 + ti, j}, 0 if ai = bj where ti, j = 1 if ai bj. 9/19/2021 Taste @ ICDE 2012 Substitute a->i Insert o ED(r, s) 8
Problem Formulation �Approximate Entity Extraction: Given a dictionary of entities E = {e 1, e 2, . . . , en}, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all “similar” pairs <s, ei> such that ED(s, ei) ≤ τ, where s is a substring of D and ei∈ E. 9/19/2021 Taste @ ICDE 2012 9
State-of-the-art Methods Inverted Index Filtering Condition NGPP Prefix of 1 -variant family a substring matches with the prefix of 1 -variant of partition Faerie q-grams Overlap must be larger than a threshold �Shortage: �Large Index Size �Need to Tune Parameters �Inefficient for large threshold 9/19/2021 Taste @ ICDE 2012 10
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 11
Observation �Set τ=2 substring: voncouver 9/19/2021 entity: vankatesh Split the entity into τ+1=3 segments A substring in document Taste @ ICDE 2012 12
Observation �Set τ=2 substring: voncouver 9/19/2021 entity: vankatesh >= 1 edit operations >= τ + 1 = 3 edit operations NOT SIMILAR Taste @ ICDE 2012 13
Trie-based Framework �Step 1: Partition entities into segments using even partition scheme. 9/19/2021 Taste @ ICDE 2012 14
Trie-based Framework �Step 2: Index the segments using a trie structure 9/19/2021 Taste @ ICDE 2012 15
Trie-based Framework �Step 3: From the document, identify substring which is similar to an entity in the dictionary using the trie structure. �Baseline: Trie-search Method 9/19/2021 Taste @ ICDE 2012 16
Trie-search Method � 3. 1 Enumerate all valid substrings. if Lmin - τ = 9 -2=7 Lmax + τ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 7 9/19/2021 Taste @ ICDE 2012 17
Trie-search Method � 3. 1 Enumerate all valid substrings. if Lmin - τ = 9 -2=7 Lmax + τ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 8 9/19/2021 Taste @ ICDE 2012 18
Trie-search Method � 3. 1 Enumerate all valid substrings. if Lmin - τ = 9 -2=7 Lmax + τ = 15+2=17 kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, Len = 17 9/19/2021 Taste @ ICDE 2012 19
Trie-search Method � 3. 2 Find each suffix of every substring in the trie structure to check if it can reach the leaf. chaudhuri Pruned 9/19/2021 Taste @ ICDE 2012 20
Trie-search Method � 3. 2 Find each suffix of every substring in the trie structure to check if it can reach the leaf. urajit chaud Candidate of entity 3, 4, 5 9/19/2021 Taste @ ICDE 2012 21
Trie-search Method � 3. 3 Verify the candidate pairs ED(surajit chaudri, urajit chaud)=3 ED(caushit chaudui , urajit chaud)=7 ED(caushit chakrab, surajit chaud)=10 9/19/2021 Taste @ ICDE 2012 22
Trie-search Method �Shortage Huge Number of candidate pairs: �<surajit_chaudhuri, surajit_chaudri> � <surajit_chaudhur , surajit_chaudri> � …. . . 9/19/2021 Taste @ ICDE 2012 23
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 24
Trie-based Algorithm �Step 3: From the document, find the matched segments from the trie structure. �To avoid duplicate computation, we do not enumerate all valid substrings. �Trie-based Algorithm 9/19/2021 Taste @ ICDE 2012 25
Trie-based Algorithm � 3. 1 Search: Search matched segments. Document: kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, 9/19/2021 Taste @ ICDE 2012 26
Trie-based Algorithm � 3. 2 Extend the matched segments to find similar pairs. D: kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, e 3: surajit chaudri e 4: caushit chaudui e 5: caushit chakrab 9/19/2021 Taste @ ICDE 2012 27
k a u s h i t _ c h e k r All Larger s 5 4 4 4 5 5 Than Two, u 5 4 3 4 4 4 Terminate! r 4 4 3 3 e 3 a 4 3 3 j 5 4 3 2 1 1 i 5 4 3 2 1 0 e g an 2 2 2 m o C o _ d e e h d t u p t N o a u R s i h T e t c a b a r N r 9/19/2021 i Taste @ ICDE 2012 28
k a u s h i t _ c h e k r Not Larger c 1 1 2 3 4 5 Than Two, a 1 0 1 2 3 4 Extend The u 2 1 0 1 2 3 Right Part! s 3 2 1 h 4 3 e 4 i 5 0 1 2 1 4 3 e g an 2 0 m o C o t _ c d e e h N o a u d 9/19/2021 N R s i h T e 0 1 2 1 a b a r t t u p 0 1 2 3 4 5 1 2 3 3 1 6 7 4 5 6 2 2 2 3 4 4 5 6 3 3 3 5 u 4 4 4 5 6 6 i 5 6 7 Taste @ ICDE 2012 5 5 5 All Larger Than Two, Prune! 29
k a u s h i t _ c h e k r Same as the c 1 1 2 3 4 5 Not Larger left part. Two, of e 4 a 1 0 1 2 3 4 Than Share Prefix! Extend The u 2 1 0 1 2 3 Right Part! s 3 2 1 h 4 3 e 5 i 5 0 1 2 1 4 3 e g an 2 0 m o C o t _ c d e e h N o Share Prefix! a k r 9/19/2021 N R s i h T e 0 1 2 1 a b a r t t u p 0 1 2 3 4 5 1 2 3 3 4 5 6 2 3 4 5 6 1 2 2 1 3 3 2 1 a 4 4 3 b 5 Taste @ ICDE 2012 5 2 3 2 1 4 3 6 7 4 5 2 3 2 1 4 2 3 Not Larger Than Two, Get Candidates! 30
Trie-based Algorithm � 3. 2 Extend the matched segments to find similar pairs. it_ch ush 2 aush 1 kaush 1 9/19/2021 2 ekra 1 ekrab 2 ekraba Taste @ ICDE 2012 31
Trie-based Algorithm � 3. 2 Extend the matched segments to find similar pairs. We get two results pairs: <caushit chakrab, aushit_chekrab> <caushit chakrab, kaushit_chekrab> 9/19/2021 Taste @ ICDE 2012 32
Trie-based Algorithm �Two-level Trie 9/19/2021 Taste @ ICDE 2012 33
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 34
Observation �Even Partition: vanateshe van 9/19/2021 ate Taste @ ICDE 2012 she 35
Observation �Even Partition: Extend 5 times. C=5 vanateshe van ate she an efficient filter for approximate membershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008. 9/19/2021 Taste @ ICDE 2012 36
Observation �Uneven Partition: vanateshe vana 9/19/2021 tes Taste @ ICDE 2012 he 37
Observation �Uneven Partition: Extend 2 times. C=2 vanateshe vana tes he an efficient filter for approximate membershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008. 9/19/2021 Taste @ ICDE 2012 38
Optimization Objectiveness Entity: Segments: g 1 Appear Time: Wg 1 c 2 c 3 c 4 … … cm-2 cm-1 cm … … gτ g 2 Wgτ gτ+1 Wgτ+1 Document Optimization Objectiveness: C=M[τ+1][m]. M[i][j]: the minimum total appear times of the i segments generated by c 1 c 2 … cj-1 cj. 9/19/2021 Taste @ ICDE 2012 39
Dynamic Programming If the start position of last partition is p, Then: Iterative we have 9/19/2021 Taste @ ICDE 2012 40
Determining Wg �Build A Suffix Trie. The segment g The number of subtrie leaf nodes = The occurrence time in document = Wg 9/19/2021 Taste @ ICDE 2012 41
Pruning �Even VS. Dict+Doc Method �Even involves larger candidate set �Dict+Doc counts the indexing time �Make indexing more efficient �Using Segment Length to Do Pruning. �Using Even-Partition Weight as An Upper Bound. �Adding Extra Pointers On Suffix Trie. 9/19/2021 Taste @ ICDE 2012 42
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 43
Experiment Setup �Three real Data sets �Existing algorithms NGPP (downloaded from its hompage) � Faerie (we implemented) � 9/19/2021 Taste @ ICDE 2012 44
Search-Extension VS Sort-Extension 9/19/2021 Taste @ ICDE 2012 45
Partition : Even vs Dict+Doc 9/19/2021 Taste @ ICDE 2012 46
Comparison with State-of-the-art Methods Faerie VS NGPP 9/19/2021 Taste @ ICDE 2012 47
Scalability with Dictionary Sizes 9/19/2021 Taste @ ICDE 2012 48
Outline �Motivation �Problem Formulation �Trie-based Framework �Trie-based Algorithms �Optimizing Partition Scheme �Experiment �Conclusion 9/19/2021 Taste @ ICDE 2012 49
Conclusion �Approximate entity extraction �A trie-based framework �A trie-based algorithm method �Selecting high-quality partition scheme 9/19/2021 Taste @ ICDE 2012 50
THANKS! Q&A http: //dbgroup. cs. tsinghua. edu. cn/dd/projects/taste/
- Slides: 51