Tiny Lex Static NGram Index Pruning with Perfect

  • Slides: 38
Download presentation
Tiny. Lex: Static N-Gram Index Pruning with Perfect Recall Derrick Coetzee, Microsoft Research CC

Tiny. Lex: Static N-Gram Index Pruning with Perfect Recall Derrick Coetzee, Microsoft Research CC 0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.

Motivation � Consider searching for a subsequence in a collection of genome sequences: …gcaagctttatagtgacaacaataaggtatcactcggtt…

Motivation � Consider searching for a subsequence in a collection of genome sequences: …gcaagctttatagtgacaacaataaggtatcactcggtt… � N-gram inverted indexes are the traditional solution, but have 10 -100 times more terms than ordinary word-based inverted indexes � Tiny. Lex indexes achieve similar query performance with 7 -17 times less terms � Tiny. Lex provides good worst-case query performance Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 2

Inverted indexes � 1. Each wife had seven sacks, � 2. Each sack had

Inverted indexes � 1. Each wife had seven sacks, � 2. Each sack had seven cats, � 3. Each cat had seven kits. � 4. Kits, cats, sacks, and wives. each: {1, 2, 3} had: {1, 2, 3} seven: {1, 2, 3} wife: {1, 4} sack: {1, 2, 4} cat: {2, 3, 4} kit: {3, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 3

Inverted indexes � 1. Each wife had seven sacks, � 2. Each sack had

Inverted indexes � 1. Each wife had seven sacks, � 2. Each sack had seven cats, � 3. Each cat had seven kits. � 4. Kits, cats, sacks, and wives. Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 4

Limitations of inverted indexes � Partial word or punctuation queries ◦ Searching “ment” ◦

Limitations of inverted indexes � Partial word or punctuation queries ◦ Searching “ment” ◦ Searching � Searching a dictionary for all words ending in for <b> in HTML files for "%s" in C source files for x^2/2 in La. Te. X source files East Asian language text ◦ No spaces, word extraction is complex � Phrase searching Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 5

Limitations of inverted indexes Genome sequences: � 1. gcaagctttatagtgacaac. . . � 2. aataaggtatcactcggtta.

Limitations of inverted indexes Genome sequences: � 1. gcaagctttatagtgacaac. . . � 2. aataaggtatcactcggtta. . . � 3. caattacccccacttcccct. . . � 4. cattataaagaaatgatcaa. . . Example query: Documents containing subsequence “cact” Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 6

N-gram inverted indexes Simplified example: Two-letter alphabet � 1. babbbbabab � 2. aababaaabb �

N-gram inverted indexes Simplified example: Two-letter alphabet � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb aaa: {2} aab: {2, 3, 4} aba: {1, 2, 3} abb: {1, 2, 4} baa: {2, 3, 4} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 7

N-gram inverted indexes Query: aaba aab and aba � 1. babbbbabab � 2. aababaaabb

N-gram inverted indexes Query: aaba aab and aba � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 8

N-gram inverted indexes Query: aaba aab and aba aab: {2, 3, 4} aba: {1,

N-gram inverted indexes Query: aaba aab and aba aab: {2, 3, 4} aba: {1, 2, 3} {2, 3, 4} ∩ {1, 2, 3} = {2, 3} � 1. babbbbabab � 2. aababaaabb � 3. babaab (false positive) � 4. bbbbaabbbb Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 9

Selecting n-gram length � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4.

Selecting n-gram length � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb a: {1, 2, 3, 4} b: {1, 2, 3, 4} length = 1 Small number of terms Slow queries • Long posting lists • Too many false positives Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 10

Selecting n-gram length � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4.

Selecting n-gram length � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb aababa: {2} aabbbb: {4} abaaab: {2} ababaa: {2, 3} ababab: {3} abbbba: {1} baaabb: {2} baabbb: {4} babaaa: {2} babaab: {3} bababa: {3} babbbb: {1} bbaabb: {4} bbabab: {1} bbbaab: {4} bbbaba: {1} bbbbaa: {4} bbbbab: {1} length = 6 Fast queries Too many terms Queries must be ≥ 6 characters Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 11

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny.

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny. Lex index construction � Results � Disadvantages � Questions Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 12

Tiny. Lex � Goal: less terms without sacrificing query performance � Consider the n-grams

Tiny. Lex � Goal: less terms without sacrificing query performance � Consider the n-grams “juggl” and “uggle” ◦ Almost exactly the same posting list in a typical English language collection ◦ Just put the n-gram “uggl” in the index, and leave out “juggl” and “uggle” juggl: {2, 7, 33} uggle: {2, 7, 33} uggl: {2, 7, 33} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 13

Tiny. Lex � Insight: The more false positives a term produces when it is

Tiny. Lex � Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index. � Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it. � Allow variable-length n-grams. Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 14

Tiny. Lex: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4.

Tiny. Lex: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} In this example t = 1. At most 1 false positive is allowed for any query. Only 10 terms! bba: {1, 4} bbb: {1, 4} aaba: {2} baab: {3, 4} babb: {1} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 15

Tiny. Lex: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4.

Tiny. Lex: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb Query: abaab aba and baab aba: {1, 2, 3} baab: {3, 4} {1, 2, 3} ∩ {3, 4} = {3} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 16

Tiny. Lex: Nonoccurring terms � The construction guarantees that if the query term occurs

Tiny. Lex: Nonoccurring terms � The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case). � If we observe t false positives, we can halt immediately. Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 17

Tiny. Lex: Nonoccurring terms Query: bbbbb and bbb bbb: {1, 4} ∩ {1, 4}

Tiny. Lex: Nonoccurring terms Query: bbbbb and bbb bbb: {1, 4} ∩ {1, 4} = {1, 4} 1. babbbbabab (false positive). . . can’t happen unless the query result is empty. Halt. Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 18

Tiny. Lex: Benefits � Achieve similar query performance to classical n-gram indexes with a

Tiny. Lex: Benefits � Achieve similar query performance to classical n-gram indexes with a much larger number of terms � Worst-case bound on number of false positives � Query can be any length Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 19

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny.

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny. Lex index construction � Results � Disadvantages � Questions Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 20

Constructing a Tiny. Lex index � The problem: ◦ Input: a set of documents,

Constructing a Tiny. Lex index � The problem: ◦ Input: a set of documents, a threshold t ◦ Output: a list of terms such that any query for a term occurring in the collection will have at most – 1 false positives Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee t 21

Constructing a Tiny. Lex index � Basic construction: � For each n-gram length from

Constructing a Tiny. Lex index � Basic construction: � For each n-gram length from 1 to max: ◦ Make a list of all n-grams in the collection and what documents they occur in. ◦ Perform a query on each term using the partially constructed index. ◦ If a term has too many false positives, add it to the index. Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 22

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb (index empty) 1 -grams Query result Actual a {1, 2, 3, 4} b {1, 2, 3, 4} t =1 If the difference between the query result size and the actual posting list size is at least 1, add it to the index. Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 23

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb 2 -grams Query result Actual aa {1, 2, 3, 4} {2, 3, 4} ab {1, 2, 3, 4} ba {1, 2, 3, 4} bb {1, 2, 3, 4} {1, 2, 4} (index empty) Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 24

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb

Construction: Example � 1. babbbbabab � 2. aababaaabb � 3. babaab � 4. bbbbaabbbb 2 -grams Query result Actual aa {1, 2, 3, 4} {2, 3, 4} ab {1, 2, 3, 4} ba {1, 2, 3, 4} bb {1, 2, 3, 4} {1, 2, 4} aa: {2, 3, 4} bb: {1, 2, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 25

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111 aa: {2, 3, 4} bb: {1, 2, 4} 3 -grams Query result Actual aaa {2, 3, 4} {2} aab {2, 3, 4} aba {1, 2, 3, 4} {1, 2, 3} abb {1, 2, 4} baa {2, 3, 4} bab {1, 2, 3, 4} {1, 2, 3} bba {1, 2, 4} {1, 4} bbb {1, 2, 4} {1, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 26

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111 aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 3 -grams Query result Actual aaa {2, 3, 4} {2} aab {2, 3, 4} aba {1, 2, 3, 4} {1, 2, 3} abb {1, 2, 4} baa {2, 3, 4} bab {1, 2, 3, 4} {1, 2, 3} bba {1, 2, 4} {1, 4} bbb {1, 2, 4} {1, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 27

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111 aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 4 -grams Query result Actual aaab {2} aaba {2, 3} {2} aabb {2, 4} abaa {2, 3} abab {1, 2, 3} abbb {1, 4} baaa {2} baab {2, 3, 4} {3, 4} baba {1, 2, 3} babb {1, 2} {1} bbaa {4} bbab {1} bbba {1, 4} bbbb {1, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 28

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111

Construction: Example � 1. 1011110101 � 2. 0010100011 � 3. 101001 � 4. 1111001111 aa: {2, 3, 4} aaba: {2} bb: {1, 2, 4} baab: {3, 4} aaa: {2} babb: {1} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 4 -grams Query result Actual aaab {2} aaba {2, 3} {2} aabb {2, 4} abaa {2, 3} abab {1, 2, 3} abbb {1, 4} baaa {2} baab {2, 3, 4} {3, 4} baba {1, 2, 3} babb {1, 2} {1} bbaa {4} bbab {1} bbba {1, 4} bbbb {1, 4} Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 29

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny.

Overview � Review of inverted n-gram indexes � Example Tiny. Lex index � Tiny. Lex index construction � Results � Disadvantages � Questions Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 30

Results � Test set: 100 MB TREC WSJ collection � 37000 documents, English text

Results � Test set: 100 MB TREC WSJ collection � 37000 documents, English text � Same query performance with 7 -17 times less terms Mean query time (ms) 600 Tiny. Lex index 500 400 Classical ngram index 300 200 100 0 1 E+3 1 E+4 1 E+5 Number of terms Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 1 E+6 31

Results � Overall compressed index size 2 -20% less � Tiny. Lex index has

Results � Overall compressed index size 2 -20% less � Tiny. Lex index has more information per term Mean query time (ms) 600 Tiny. Lex index 500 Classical ngram index 400 300 200 100 0 0 25 50 Index size (MB) 75 Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 100 32

Results � Dramatic 50 x improvement in worst-case query performance for long queries Worst

Results � Dramatic 50 x improvement in worst-case query performance for long queries Worst query time (ms) 4000 3500 3000 2500 2000 1500 1000 500 0 6 -grams Tiny. Lex index of same size 0 10 20 30 40 50 60 70 Query length in characters Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 80 90 33

See paper for � Applications to phrase searching using variable-length word n-grams � Making

See paper for � Applications to phrase searching using variable-length word n-grams � Making the construction more efficient � Performance on genome sequences � Empirical evaluation of scaling Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 34

Related work � Suffix arrays (Manber and Myers 1991) � agrep and GLIMPSE (Wu

Related work � Suffix arrays (Manber and Myers 1991) � agrep and GLIMPSE (Wu and Manber 1994) ◦ Faster queries, but indexes 3 -10 times larger ◦ More general queries, but relies on a word concept � n-Gram/2 L (Kim et al 2005) ◦ Orthogonal; examines less document offsets � “Growing an n-gram language model” ◦ (Siivola and Pellom 2005) ◦ Similar idea applied to language modeling Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 35

Future work � Faster construction time ◦ Currently about 10 times slower to construct

Future work � Faster construction time ◦ Currently about 10 times slower to construct than a classical n-gram index. � Queries for nonoccurring terms are more expensive than with classical n-gram indexes (t documents must be read). � Generalize to dynamic collections Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 36

Conclusions � N-gram indexes enable practical queries for subsequences � Tiny. Lex indexes achieve

Conclusions � N-gram indexes enable practical queries for subsequences � Tiny. Lex indexes achieve similar query performance to classical n-gram indexes with 7 -17 times less terms � Tiny. Lex yields good worst-case query performance by placing an upper bound on the number of false positives Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 37

Questions? Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 38

Questions? Tiny. Lex: Static N-Gram Index Pruning - Derrick Coetzee 38