Efficient Approximate Search on String Collections Part I

DBLP Author Search http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 2

Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios

Better system? http: //dblp. ics. uci. edu/authors/ 5

People Search at UC Irvine http: //psearch. ics. uci. edu/ 6

Web Search Actual queries gathered by Google http: //www. google. com/jobs/britney. html m Errors

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q, D) <= δ Example:

Outline l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms

Similarity Functions l l Similar to: l a domain-specific function l returns a similarity

Edit Distance A widely used metric to define string similarity l Ed(s 1, s

State-of-the-art: Oracle 10 g and older versions l l l Supported by Oracle Text

Microsoft SQL Server [CGG+05] l l l Data cleaning tools available in SQL Server

Lucene l l l Using Levenshtein Distance (Edit Distance). Example: roam~0. 8 Prefix pruning

Active nodes on Trie $ e s x a Query: “example” Edit-distance threshold =

Initialization l $ Q=ε 0 Prefix Distance e 1 s 1 ε 0 x

Incremental Algorithm Return leaf nodes as answers. 21

Good and bad l Advantages: l l Trie size is small Can do search

Next… Gram-based algorithms l List-merging algorithms [LLL 08] l Variable-length grams (VGRAM) [LWY 07,

“q-grams” of strings universal 2 -grams 25

Edit operation’s effect on grams universal Fixed length: q k operations could affect k

q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck

Searching using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh id 0

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 29

Example l T=4 1 10 5 3 13 7 5 15 13 13 15

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element

Merge. Opt Algorithm [SK 04] Binary search Long Lists: T-1 Short Lists 33

Example of Merge. Opt 1 10 5 3 13 7 5 15 13 13

Scan. Count String ids # of occurrences 1 2 1 0 3 0 …

Merge. Skip algorithm [BK 02, LLL 08] …… Min-heap Jump Pop T-1 Greater or

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10

Divide. Skip Algorithm [LLL 08] Binary Merge. Skip search Long Lists Short Lists 39

How many lists are treated as long lists? 40

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length:

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b

Normalized weights [HKS 08] l Compute a weight for each string l l l

Pruning using normalized weights l l l Sort inverted lists based on string weights

Next… Variable-length grams (VGRAM) [LWY 07, YWL 08] 45

2 -grams -> 3 -grams? l Query: “shtick”, ED(shtick, ? )≤ 1 sht id

Observation 1: dilemma of choosing “q” l Increasing “q” causing: l l id 0

Observation 2: skew distributions of gram frequencies l DBLP: 276, 699 article titles l

VGRAM: Main idea l Grams with variable lengths (between qmin and qmax) l zebra

Challenges l l Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string

Challenge 1: String Variable-length grams? l Fixed-length 2 -grams universal l Variable-length grams [2,

Representing gram dictionary as a trie ni ivr sal uni vers 52

Step 2: Constructing a gram dictionary qmin=2 qmax=4 l l Frequency-based [LYW 07] Cost-based

Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could

Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1

Main idea l l l For a string, for each position, compute the number

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: l String s

Lower bound on # of common grams Fixed length (q) universal If ed(s 1,

Example: algorithm using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh 2

End of part I l l l l l Motivation Preliminaries Trie-based approach Gram-based

Slides: 61

Download presentation

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 2

Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios Hadjieleftheriou http: //www. informatik. uni-trier. de/~ley/db/indices/a-tree/index. html 3

Better system? http: //dblp. ics. uci. edu/authors/ 5

People Search at UC Irvine http: //psearch. ics. uci. edu/ 6

Web Search Actual queries gathered by Google http: //www. google. com/jobs/britney. html m Errors in queries m Errors in data m Bring query and meaningful results closer together 7

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q, D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9

Outline l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 10

Next… Preliminaries 11

Similarity Functions l l Similar to: l a domain-specific function l returns a similarity value between two strings Examples: l Edit distance l Hamming distance l Jaccard similarity l Soundex l TF/IDF, BM 25, DICE l See [KSS 06] for an excellent survey 12

Edit Distance A widely used metric to define string similarity l Ed(s 1, s 2) = minimum # of operations (insertion, deletion, substitution) to change s 1 to s 2 l Example: s 1: Tom Hanks s 2: Ton Hank ed(s 1, s 2) = 2 l 13 13

State-of-the-art: Oracle 10 g and older versions l l l Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl. create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl. set_attribute('STEM_FUZZY_PREF', 'FUZZY_MATCH', 'ENGLISH'); ctx_ddl. set_attribute('STEM_FUZZY_PREF', 'FUZZY_SCORE', '0'); ctx_ddl. set_attribute('STEM_FUZZY_PREF', 'FUZZY_NUMRESULTS', '5000'); ctx_ddl. set_attribute('STEM_FUZZY_PREF', 'SUBSTRING_INDEX', 'TRUE'); ctx_ddl. set_attribute('STEM_FUZZY_PREF', 'STEMMER', 'ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys. context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; l Limitation: cannot handle errors in the first letters: Katherine versus Catherine 14

Microsoft SQL Server [CGG+05] l l l Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 15

Lucene l l l Using Levenshtein Distance (Edit Distance). Example: roam~0. 8 Prefix pruning followed by a scan (Efficiency? ) 16

Next… Trie-based approach [JLL+09] 17

Trie Indexing $ e s x a a e m m m p p l l t e e a $ $ $ r Strings example exemplar exempt sample $ 18

Active nodes on Trie $ e s x a Query: “example” Edit-distance threshold = 2 a e m m m p Prefix Distance p l examp 2 exampl 1 p 2 l 1 l 2 t e 2 example 0 a 2 $ $ exempl 2 exempla 2 sample 2 $ r $ 19

Initialization l $ Q=ε 0 Prefix Distance e 1 s 1 ε 0 x 2 a 2 e 1 ex 2 a e m s 1 m m p sa 2 p l l t e e a $ $ $ r Initial active nodes: all nodes within depth δ $ 20

Incremental Algorithm Return leaf nodes as answers. 21

l Q=example s 1 x 1 a 2 e 2 m m m p p e ε 0 e 1 ex 2 s 1 sa 2 Prefix # Op Base Op l ε 1 ε del e edel e $ subee/x del s 1 ε sub e/s e 0 ε mat e 2 3 t s e $ e ex ex 1 ε ins x 3 ex sub e/a exa 2 ε Ins xa exe $ 2 ex mat e exe 2 ε Ins xe sa subee/a del p l Distance s e ex l a $ exa r 2 2 2 3 Active nodes for Q = e $ e 0 Prefix Active nodes for Q = ε 1 22

Good and bad l Advantages: l l Trie size is small Can do search as the user types l Disadvantages l 23 Works for edit distance only 23

Next… Gram-based algorithms l List-merging algorithms [LLL 08] l Variable-length grams (VGRAM) [LWY 07, YWL 08] 24

“q-grams” of strings universal 2 -grams 25

Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams If ed(s 1, s 2) <= k, then their # of common grams >= (|s 1|- q + 1) – k * q 26

q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 27

Searching using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh id 0 1 2 3 4 ht strings rich stick stich stuck static ti 2 -grams ic at ch ck ic ri st ta ti tu uc ck # of common grams >= 3 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 28

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 29

Example l T=4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 30

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 31

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap 32

Merge. Opt Algorithm [SK 04] Binary search Long Lists: T-1 Short Lists 33

Example of Merge. Opt 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 34

Scan. Count String ids # of occurrences 1 2 1 0 3 0 … 1 0 Increment by 1 1 10 5 3 13 7 5 15 13 13 15 10 13 14 15 0 4 0 2 0 Result! 13 Count threshold T≥ 4 35

List-Merging Algorithms Heap. Merger Merge. Opt [SK 04] [LLL 08, BK 02] Scan. Count Merge. Skip Divide. Skip 36

Merge. Skip algorithm [BK 02, LLL 08] …… Min-heap Jump Pop T-1 Greater or equals T-1 37

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10 5 3 13 7 5 15 17 13 15 10 15 Count threshold T≥ 4 38

Divide. Skip Algorithm [LLL 08] Binary Merge. Skip search Long Lists Short Lists 39

How many lists are treated as long lists? 40

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length: 19 41

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b (ab, 12) 42

Normalized weights [HKS 08] l Compute a weight for each string l l l L 0: length of the string L 1, L 2: Depend on q-gram frequencies Similar strings have similar weights l A very strong pruning condition 43

Pruning using normalized weights l l l Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned) 44

Next… Variable-length grams (VGRAM) [LWY 07, YWL 08] 45

2 -grams -> 3 -grams? l Query: “shtick”, ED(shtick, ? )≤ 1 sht id 0 1 2 3 4 hti strings rich stick stich stuck static 3 -grams ick ati ich ick ric sta sti stu tat tic tuc uck # of common grams >= 1 4 0 1 0 4 1 3 3 2 2 2 4 46

Observation 1: dilemma of choosing “q” l Increasing “q” causing: l l id 0 1 2 3 4 Longer grams Shorter lists Smaller # of common grams of similar strings rich stick stich stuck static 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 47

Observation 2: skew distributions of gram frequencies l DBLP: 276, 699 article titles l Popular 5 -grams: ation (>114 K times), tions, ystem, catio 48

VGRAM: Main idea l Grams with variable lengths (between qmin and qmax) l zebra - l corrasion - l ze(123) co(5213), cor(859), corr(171) Advantages l l l Reduce index size Reducing running time Adoptable by many algorithms 49

Challenges l l Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram -set similarity? Adopting VGRAM in existing algorithms? 50

Challenge 1: String Variable-length grams? l Fixed-length 2 -grams universal l Variable-length grams [2, 4]-gram dictionary universal ni ivr sal uni vers 51

Representing gram dictionary as a trie ni ivr sal uni vers 52

Step 2: Constructing a gram dictionary qmin=2 qmax=4 l l Frequency-based [LYW 07] Cost-based [YLW 08] 53

Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams 54

Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1 55

Main idea l l l For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2, 4, 6, 8, 9> With 2 edit operations, at most 4 grams can be affected l Use this number to do count filtering 56

Summary of VGRAM index 57

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: l String s grams l String s 1, s 2 such that ed(s 1, s 2) <= k min # of their common grams 58

Lower bound on # of common grams Fixed length (q) universal If ed(s 1, s 2) <= k, then their # of common grams >=: (|s 1|- q + 1) – k * q Variable lengths: # of grams of s 1 – NAG(s 1, k) 59

Example: algorithm using inverted lists l Query: “shtick”, ED(shtick, ? )≤ 1 sh 2 -grams … ck ic … ti … ht tick 2 -4 grams Lower bound = 3 1 0 3 1 2 4 id 0 1 2 3 4 4 strings rich stick stich stuck static … ck ic ich … tick … 1 1 0 3 4 2 2 1 4 Lower bound = 1 60

End of part I l l l l l Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part II 61