Efficient Approximate Search on String Collections Part I
- Slides: 50
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1
DBLP Author Search http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 2
Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios Hadjieleftheriou http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 3
4
Better system? http: //dblp. ics. uci. edu/authors/ 5
People Search at UC Irvine http: //psearch. ics. uci. edu/ 6
Web Search Actual queries gathered by Google http: //www. google. com/jobs/britney. html m Errors in queries m Errors in data m Bring query and meaningful results closer together 7
Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8
Problem Formulation Find strings similar to a given string: dist(Q, D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9
Outline l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 10
Next… Preliminaries 11
Similarity Functions l l Similar to: l a domain-specific function l returns a similarity value between two strings Examples: l Edit distance l Hamming distance l Jaccard similarity l Soundex l TF/IDF, BM 25, DICE l See [KSS 06] for an excellent survey 12
Edit Distance A widely used metric to define string similarity l Ed(s 1, s 2) = minimum # of operations (insertion, deletion, substitution) to change s 1 to s 2 l Example: s 1: Tom Hanks s 2: Ton Hank ed(s 1, s 2) = 2 l 13 13
Next… Gram-based algorithms l List-merging algorithms [LLL 08] l Variable-length grams (VGRAM) [LWY 07, YWL 08] 14
“q-grams” of strings universal 2 -grams 15
Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams If ed(s 1, s 2) <= k, then their # of common grams >= (|s 1|- q + 1) – k * q 16
q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 17
Searching using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh id 0 1 2 3 4 ht strings rich stick stich stuck static ti 2 -grams ic at ch ck ic ri st ta ti tu uc ck # of common grams >= 3 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 18
T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 19
Example l T=4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 20
List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 21
Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap 22
Merge. Opt Algorithm [SK 04] Binary search Long Lists: T-1 Short Lists 23
Example of Merge. Opt 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 24
Scan. Count String ids # of occurrences 1 2 1 0 3 0 … 1 0 Increment by 1 1 10 5 3 13 7 5 15 13 13 15 10 13 14 15 0 4 0 2 0 Result! 13 Count threshold T≥ 4 25
List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 26
Merge. Skip algorithm [BK 02, LLL 08] …… Min-heap Jump Pop T-1 Greater or equals T-1 27
Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10 5 3 13 7 5 15 17 13 15 10 15 Count threshold T≥ 4 28
Divide. Skip Algorithm [LLL 08] Binary Merge. Skip search Long Lists Short Lists 29
How many lists are treated as long lists? 30
Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length: 19 31
Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b (ab, 12) 32
A filter tree Combine filters with list-merging algorithms [LLL 08] 33
Next… Variable-length grams (VGRAM) [LWY 07, YWL 08] 34
2 -grams -> 3 -grams? l Query: “shtick”, ED(shtick, ? )≤ 1 sht id 0 1 2 3 4 hti strings rich stick stich stuck static 3 -grams ick ati ich ick ric sta sti stu tat tic tuc uck # of common grams >= 1 4 0 1 0 4 1 3 3 2 2 2 4 35
Observation 1: dilemma of choosing “q” l Increasing “q” causing: l l id 0 1 2 3 4 Longer grams Shorter lists Smaller # of common grams of similar strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 36
Observation 2: skew distributions of gram frequencies l DBLP: 276, 699 article titles l Popular 5 -grams: ation (>114 K times), tions, ystem, catio 37
VGRAM: Main idea l Grams with variable lengths (between qmin and qmax) l zebra - l corrasion - l ze(123) co(5213), cor(859), corr(171) Advantages l l l Reduce index size Reducing running time Adoptable by many algorithms 38
Challenges l l Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram -set similarity? Adopting VGRAM in existing algorithms? 39
Challenge 1: String Variable-length grams? l Fixed-length 2 -grams universal l Variable-length grams [2, 4]-gram dictionary universal ni ivr sal uni vers 40
Representing gram dictionary as a trie ni ivr sal uni vers 41
Step 2: Constructing a gram dictionary qmin=2 qmax=4 l l Frequency-based [LYW 07] Cost-based [YLW 08] 42
Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams 43
Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1 44
Main idea l l l For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2, 4, 6, 8, 9> With 2 edit operations, at most 4 grams can be affected l Use this number to do count filtering 45
Summary of VGRAM index 46
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: l String s grams l String s 1, s 2 such that ed(s 1, s 2) <= k min # of their common grams 47
Lower bound on # of common grams Fixed length (q) universal If ed(s 1, s 2) <= k, then their # of common grams >=: (|s 1|- q + 1) – k * q Variable lengths: # of grams of s 1 – NAG(s 1, k) 48
Example: algorithm using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh 2 -grams … ck ic … ti … ht tick 2 -4 grams Lower bound = 3 1 0 3 1 2 4 id 0 1 2 3 4 4 strings rich stick stich stuck static … ck ic ich … tick … 1 1 0 3 4 2 2 1 4 Lower bound = 1 49
End of part I l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 50
- Http protocol description
- A guided tour to approximate string matching
- A guided tour to approximate string matching
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- C b a d
- Productively efficient vs allocatively efficient
- Productive inefficiency and allocative inefficiency
- Lshzoo.cc
- Const char *s=""
- Public string name
- Java new string
- Skip search
- Megabyte
- Approximate computing
- Approximate 645 to the nearest hundred
- Kinesthetic imagery poem examples
- The approximate dates of the baroque period are
- Musical devices
- Fast exact and approximate geodesics on meshes
- A building bent deflects in the way same as a:
- Approximate computing
- Approximate counting algorithm
- Fourteen billion years represents the approximate age of
- Staple ridge fingerprint
- Approximate the best fitting line for the data
- Asymmetrically balanced art
- What is the approximate percentage of oxygen in the air?
- Sketch techniques for approximate query processing
- Approximate cell decomposition
- What does this map represent
- Refrain literary definition
- Dc bias with voltage feedback
- Times are approximate
- Using system.collections
- Pseg credit and collections
- Java collections tree
- Collections trust spectrum
- Hltpat001
- Huge collections of stars
- Java collections overview
- Are all the methods in the collections class static?
- Chapter 20 patient collections and financial management
- The html
- Ctech collects
- Fscm collection management best practices
- Java collections framework diagram
- Subinterface java
- Financial data systems llc
- Cara menampilkan preview file di windows 10
- Abstract data type (adt)