Efficient Merging and Filtering Algorithms for Approximate String

  • Slides: 40
Download presentation
Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California,

Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Example: a movie database Find movies starred Schwarrzenger. Star Keanu Reeves Title The Matrix

Example: a movie database Find movies starred Schwarrzenger. Star Keanu Reeves Title The Matrix Year 1999 Genre Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Chen Li, Jiaheng Lu, Yiming Lu 2

Data may not clean p Data integration and cleaning: Relation R Star Relation S

Data may not clean p Data integration and cleaning: Relation R Star Relation S Star Keanu Reeves Samuel Jackson Samuel L. Jackson Schwarzenegger Chen Li, Jiaheng Lu, Yiming Lu 3

Problem definition: approximate string searches Collection of strings s Search Query q: Schwarrzenger Star

Problem definition: approximate string searches Collection of strings s Search Query q: Schwarrzenger Star Keanu Reeves Samuel Jackson Schwarzenger … Sim functions: edit distance, Jaccard Coefficient and Cosine similarity Output: strings s that satisfy Sim(q, s)≤δ Chen Li, Jiaheng Lu, Yiming Lu 4

Outline Problem motivation p Preliminaries p n n Grams Inverted lists Merge algorithms p

Outline Problem motivation p Preliminaries p n n Grams Inverted lists Merge algorithms p Filtering techniques p Conclusion p Chen Li, Jiaheng Lu, Yiming Lu 5

String Grams p q-grams For example: 2 -gram u n i v e r

String Grams p q-grams For example: 2 -gram u n i v e r s a l (un), (ni), (iv), (ve), (er), (rs), (sa), (al) Chen Li, Jiaheng Lu, Yiming Lu 6

Inverted lists p Convert strings to gram inverted lists id 0 1 2 3

Inverted lists p Convert strings to gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static Chen Li, Jiaheng Lu, Yiming Lu 2 -grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 7

Main Example Query stick Data id strings 0 rich 1 stick 2 stich ed(s,

Main Example Query stick Data id strings 0 rich 1 stick 2 stich ed(s, q)≤ 1 (st, ti, ic, ck) ck 1, 3 ic 0, 1, 2, 4 st 1, 2, 3, 4 3 stuck ta 4 static ti … Chen Li, Jiaheng Lu, Yiming Lu Candidates Grams 4 1, 2, 4 Merge count >=2 8

Problem definition: Merge Ascending order Find elements whose occurrences ≥ T Chen Li, Jiaheng

Problem definition: Merge Ascending order Find elements whose occurrences ≥ T Chen Li, Jiaheng Lu, Yiming Lu 9

Example p T=4 1 10 5 3 13 7 5 15 13 13 15

Example p T=4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 Chen Li, Jiaheng Lu, Yiming Lu 10

Contributions p Three p New new merge algorithms finding: wisely using filters Chen Li,

Contributions p Three p New new merge algorithms finding: wisely using filters Chen Li, Jiaheng Lu, Yiming Lu 11

Outline Problem motivation p Preliminaries p Merge algorithms p n n Two previous algorithms

Outline Problem motivation p Preliminaries p Merge algorithms p n n Two previous algorithms Our proposed three algorithms Filtering techniques p Conclusion p Chen Li, Jiaheng Lu, Yiming Lu 12

Five Merge Algorithms Heap. Merger Merge. Opt [Sarawagi, SIGMOD 2004] Previous New Scan. Count

Five Merge Algorithms Heap. Merger Merge. Opt [Sarawagi, SIGMOD 2004] Previous New Scan. Count Chen Li, Jiaheng Lu, Yiming Lu Merge. Skip Divide. Skip 13

Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each

Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap Chen Li, Jiaheng Lu, Yiming Lu 14

Merge. Opt Algorithm Binary search Long Lists: T-1 Chen Li, Jiaheng Lu, Yiming Lu

Merge. Opt Algorithm Binary search Long Lists: T-1 Chen Li, Jiaheng Lu, Yiming Lu Short Lists 15

Example of Merge. Opt [Sarawagi et al 2004] 1 10 5 3 13 7

Example of Merge. Opt [Sarawagi et al 2004] 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 Chen Li, Jiaheng Lu, Yiming Lu 16

Can we run faster? Chen Li, Jiaheng Lu, Yiming Lu 17

Can we run faster? Chen Li, Jiaheng Lu, Yiming Lu 17

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng Lu, Yiming Lu Merge. Skip Divide. Skip 18

Scan. Count Example String ids 1 2 3 … 1 0 0 1 0

Scan. Count Example String ids 1 2 3 … 1 0 0 1 0 # of occurrences Increment by 1 1 10 5 3 13 7 5 15 13 13 15 10 13 14 15 0 4 0 2 0 Result! Chen Li, Jiaheng Lu, Yiming Lu 13 Count threshold T≥ 4 19

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng Lu, Yiming Lu Merge. Skip Divide. Skip 20

Merge. Skip algorithm …… Min-heap Jump Pop T-1 Greater or equals T-1 Chen Li,

Merge. Skip algorithm …… Min-heap Jump Pop T-1 Greater or equals T-1 Chen Li, Jiaheng Lu, Yiming Lu 21

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10

Example of Merge. Skip 1 min. Heap 5 13 Jump 10 15 1 10 5 3 13 7 5 15 17 13 15 10 15 Count threshold T≥ 4 Chen Li, Jiaheng Lu, Yiming Lu 22

Skip is safe Min-heap …… Skip # of occurrences of skipped elements ≤T-1 Chen

Skip is safe Min-heap …… Skip # of occurrences of skipped elements ≤T-1 Chen Li, Jiaheng Lu, Yiming Lu 23

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng

Five Merge Algorithms Heap. Merger Merge. Opt Previous New Scan. Count Chen Li, Jiaheng Lu, Yiming Lu Merge. Skip Divide. Skip 24

Divide. Skip Algorithm Binary Merge. Skip search Long Lists Chen Li, Jiaheng Lu, Yiming

Divide. Skip Algorithm Binary Merge. Skip search Long Lists Chen Li, Jiaheng Lu, Yiming Lu Short Lists 25

How many lists are treated as long lists? Long Lists ? Chen Li, Jiaheng

How many lists are treated as long lists? Long Lists ? Chen Li, Jiaheng Lu, Yiming Lu Lookup Short Lists Merge 26

Decide L value A good balance in the tradeoff: # of long lists =

Decide L value A good balance in the tradeoff: # of long lists = T / ( μ log. M +1) Chen Li, Jiaheng Lu, Yiming Lu 27

Experimental data sets DBLP data Chen Li, Jiaheng Lu, Yiming Lu IMDB data Google

Experimental data sets DBLP data Chen Li, Jiaheng Lu, Yiming Lu IMDB data Google Web corpus 28

Performance (DBLP) Divide. Skip is the best one Chen Li, Jiaheng Lu, Yiming Lu

Performance (DBLP) Divide. Skip is the best one Chen Li, Jiaheng Lu, Yiming Lu 29

# of access elements (DBLP) Divide. Skip is the best one Chen Li, Jiaheng

# of access elements (DBLP) Divide. Skip is the best one Chen Li, Jiaheng Lu, Yiming Lu 30

Outline Problem motivation p Preliminaries p Merge algorithms p Filtering techniques p n n

Outline Problem motivation p Preliminaries p Merge algorithms p Filtering techniques p n n p Length, positional filters Filter tree Conclusion and future work Chen Li, Jiaheng Lu, Yiming Lu 31

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length:

Length Filtering Length: 10 s: By length only! Ed(s, t) ≤ 2 t: Length: 19 Chen Li, Jiaheng Lu, Yiming Lu 32

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b

Positional Filtering Ed(s, t) ≤ 2 s a b (ab, 1) t a b (ab, 12) Chen Li, Jiaheng Lu, Yiming Lu 33

Filter tree root 1 2 aa ab 3 … … zy 1 2 5

Filter tree root 1 2 aa ab 3 … … zy 1 2 5 12 17 28 44 Chen Li, Jiaheng Lu, Yiming Lu n Gram level zz … Length level m Position level Inverted list 34

Surprising experimental results (DBLP) Divide. Skip No filter (ms) Length (ms) 2. 23 0.

Surprising experimental results (DBLP) Divide. Skip No filter (ms) Length (ms) 2. 23 0. 76 Length+Pos (ms) 1. 96 Why adding position filter increases the running time? Chen Li, Jiaheng Lu, Yiming Lu 35

Filters fragment inverts lists Merge Applying filters Saving: Cost: reduce (1)total Tree lists traversal

Filters fragment inverts lists Merge Applying filters Saving: Cost: reduce (1)total Tree lists traversal size (2)More merging Chen Li, Jiaheng Lu, Yiming Lu 36

Conclusion p Three n new merge algorithms We run faster p Interesting finding: Do

Conclusion p Three n new merge algorithms We run faster p Interesting finding: Do not abuse filters! Chen Li, Jiaheng Lu, Yiming Lu 37

Related work Approximate string matching [Navarro 2001] Fuzzy lookup in Chen Li, Jiaheng Lu,

Related work Approximate string matching [Navarro 2001] Fuzzy lookup in Chen Li, Jiaheng Lu, Yiming Lu Varied length Grams [Li et al 2007] 38

References 1. 2. 3. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik

References 1. 2. 3. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri , K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001 Chen Li, Jiaheng Lu, Yiming Lu 39

References 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM: Improving performance

References 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM: Improving performance of approximate queries on string collections using variablelength grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004 Chen Li, Jiaheng Lu, Yiming Lu 40