Efficient Approximate Search on String Collections Part I

DBLP Author Search http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 2

Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios

Better system? http: //dblp. ics. uci. edu/authors/ 5

People Search at UC Irvine http: //psearch. ics. uci. edu/ 6

Web Search Actual queries gathered by Google http: //www. google. com/jobs/britney. html m Errors

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q, D) <= δ Example:

Outline l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms

Similarity Functions l l Similar to: l a domain-specific function l returns a similarity

Edit Distance A widely used metric to define string similarity l Ed(s 1, s

Next… Gram-based algorithms l List-merging algorithms [LLL 08] l Variable-length grams (VGRAM) [LWY 07,

“q-grams” of strings universal 2 -grams 15

Edit operation’s effect on grams universal Fixed length: q k operations could affect k

q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck

Searching using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh id 0

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 19

Example l T=4 1 10 5 3 13 7 5 15 13 13 15

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element

Merge. Opt Algorithm [SK 04] Binary search Long Lists: T-1 Short Lists 23

Example of Merge. Opt 1 10 5 3 13 7 5 15 13 13

Scan. Count String ids # of occurrences 1 2 1 0 3 0 …

Merge. Skip algorithm [BK 02, LLL 08] …… Min-heap Jump Pop T-1 Greater or

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10

Divide. Skip Algorithm [LLL 08] Binary Merge. Skip search Long Lists Short Lists 29

How many lists are treated as long lists? 30

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length:

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b

A filter tree Combine filters with list-merging algorithms [LLL 08] 33

Next… Variable-length grams (VGRAM) [LWY 07, YWL 08] 34

2 -grams -> 3 -grams? l Query: “shtick”, ED(shtick, ? )≤ 1 sht id

Observation 1: dilemma of choosing “q” l Increasing “q” causing: l l id 0

Observation 2: skew distributions of gram frequencies l DBLP: 276, 699 article titles l

VGRAM: Main idea l Grams with variable lengths (between qmin and qmax) l zebra

Challenges l l Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string

Challenge 1: String Variable-length grams? l Fixed-length 2 -grams universal l Variable-length grams [2,

Representing gram dictionary as a trie ni ivr sal uni vers 41

Step 2: Constructing a gram dictionary qmin=2 qmax=4 l l Frequency-based [LYW 07] Cost-based

Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could

Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1

Main idea l l l For a string, for each position, compute the number

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: l String s

Lower bound on # of common grams Fixed length (q) universal If ed(s 1,

Example: algorithm using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh 2

End of part I l l l l l Motivation Preliminaries Trie-based approach Gram-based

Slides: 50

Download presentation

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 2

Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios Hadjieleftheriou http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 3

Better system? http: //dblp. ics. uci. edu/authors/ 5

People Search at UC Irvine http: //psearch. ics. uci. edu/ 6

Web Search Actual queries gathered by Google http: //www. google. com/jobs/britney. html m Errors in queries m Errors in data m Bring query and meaningful results closer together 7

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q, D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9

Outline l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 10

Next… Preliminaries 11

Similarity Functions l l Similar to: l a domain-specific function l returns a similarity value between two strings Examples: l Edit distance l Hamming distance l Jaccard similarity l Soundex l TF/IDF, BM 25, DICE l See [KSS 06] for an excellent survey 12

Edit Distance A widely used metric to define string similarity l Ed(s 1, s 2) = minimum # of operations (insertion, deletion, substitution) to change s 1 to s 2 l Example: s 1: Tom Hanks s 2: Ton Hank ed(s 1, s 2) = 2 l 13 13

Next… Gram-based algorithms l List-merging algorithms [LLL 08] l Variable-length grams (VGRAM) [LWY 07, YWL 08] 14

“q-grams” of strings universal 2 -grams 15

Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams If ed(s 1, s 2) <= k, then their # of common grams >= (|s 1|- q + 1) – k * q 16

q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 17

Searching using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh id 0 1 2 3 4 ht strings rich stick stich stuck static ti 2 -grams ic at ch ck ic ri st ta ti tu uc ck # of common grams >= 3 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 18

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 19

Example l T=4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 20

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 21

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap 22

Merge. Opt Algorithm [SK 04] Binary search Long Lists: T-1 Short Lists 23

Example of Merge. Opt 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 24

Scan. Count String ids # of occurrences 1 2 1 0 3 0 … 1 0 Increment by 1 1 10 5 3 13 7 5 15 13 13 15 10 13 14 15 0 4 0 2 0 Result! 13 Count threshold T≥ 4 25

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 26

Merge. Skip algorithm [BK 02, LLL 08] …… Min-heap Jump Pop T-1 Greater or equals T-1 27

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10 5 3 13 7 5 15 17 13 15 10 15 Count threshold T≥ 4 28

Divide. Skip Algorithm [LLL 08] Binary Merge. Skip search Long Lists Short Lists 29

How many lists are treated as long lists? 30

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length: 19 31

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b (ab, 12) 32

A filter tree Combine filters with list-merging algorithms [LLL 08] 33

Next… Variable-length grams (VGRAM) [LWY 07, YWL 08] 34

2 -grams -> 3 -grams? l Query: “shtick”, ED(shtick, ? )≤ 1 sht id 0 1 2 3 4 hti strings rich stick stich stuck static 3 -grams ick ati ich ick ric sta sti stu tat tic tuc uck # of common grams >= 1 4 0 1 0 4 1 3 3 2 2 2 4 35

Observation 1: dilemma of choosing “q” l Increasing “q” causing: l l id 0 1 2 3 4 Longer grams Shorter lists Smaller # of common grams of similar strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 36

Observation 2: skew distributions of gram frequencies l DBLP: 276, 699 article titles l Popular 5 -grams: ation (>114 K times), tions, ystem, catio 37

VGRAM: Main idea l Grams with variable lengths (between qmin and qmax) l zebra - l corrasion - l ze(123) co(5213), cor(859), corr(171) Advantages l l l Reduce index size Reducing running time Adoptable by many algorithms 38

Challenges l l Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram -set similarity? Adopting VGRAM in existing algorithms? 39

Challenge 1: String Variable-length grams? l Fixed-length 2 -grams universal l Variable-length grams [2, 4]-gram dictionary universal ni ivr sal uni vers 40

Representing gram dictionary as a trie ni ivr sal uni vers 41

Step 2: Constructing a gram dictionary qmin=2 qmax=4 l l Frequency-based [LYW 07] Cost-based [YLW 08] 42

Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams 43

Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1 44

Main idea l l l For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2, 4, 6, 8, 9> With 2 edit operations, at most 4 grams can be affected l Use this number to do count filtering 45

Summary of VGRAM index 46

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: l String s grams l String s 1, s 2 such that ed(s 1, s 2) <= k min # of their common grams 47

Lower bound on # of common grams Fixed length (q) universal If ed(s 1, s 2) <= k, then their # of common grams >=: (|s 1|- q + 1) – k * q Variable lengths: # of grams of s 1 – NAG(s 1, k) 48

Example: algorithm using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh 2 -grams … ck ic … ti … ht tick 2 -4 grams Lower bound = 3 1 0 3 1 2 4 id 0 1 2 3 4 4 strings rich stick stich stuck static … ck ic ich … tick … 1 1 0 3 4 2 2 1 4 Lower bound = 1 49

End of part I l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 50