Using Fingerprints in nGram Indices Stefan Selbach selbachinformatik
Using Fingerprints in n-Gram Indices Stefan Selbach selbach@informatik. uni-wuerzburg. de Digital Libraries: Advanced Methods and Technologies, Digital Collections 17. 09. 2009
Using Fingerprints in n-Gram Indices Overview • Introduction – – • • Inverted Index N-Gram Index Bitmaps Signature Files n-Gram Fingerprints in Combination with Posting Lists Fingerprint Compression Conclusion and Future Work Thursday, September 17, 2009
INTRODUCTION Thursday, September 17, 2009
Inverted Index • Very common index structure • Term-oriented • Every term is linked to its postings Thursday, September 17, 2009
n-Gram Index • Uses n-Grams as indexing terms • Any kind of subsequence can be searched • n-Gram is a subsequence of a text with • Postings for longer subsequences can be calculated: Thursday, September 17, 2009
n-Gram Index • Index structure is very similar to an inverted index • Searching is more complex Thursday, September 17, 2009
Bitmaps • Bitmaps are occurrence maps • Each bit signals an occurrence of a specific term in a specific document Thursday, September 17, 2009
Signature Files Thursday, September 17, 2009
N-GRAM FINGERPRINT Thursday, September 17, 2009
N-Gram Fingerprint The idea: Create fingerprints that: • Have a fixed size • Contain information about the postings Thursday, September 17, 2009
N-Gram Fingerprint A 2 D-Fingerprint is a bit-matrix Thursday, September 17, 2009
N-Gram Fingerprint • Given two 1 -grams and their fingerprints Bw 1 and Bw 2 the fingerprint Bw 1 w 2 can be aproximated: • B’w 2 is constructed by cyclic shifting each column of Bw 2 by one position to the left. Thursday, September 17, 2009
N-Gram Fingerprint Thursday, September 17, 2009
N-Gram Fingerprint Search Speed Query Bitmatrix Time for verification Hits rhinolo 219 ms 94 ms 18 sanfilipo 290 ms 0 itracon 266 ms 336 ms 64 oxyuria 197 ms 48 ms 6 Results from the “Online Encyclopedia of Dermatology from P. Altmeyer” Thursday, September 17, 2009
N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING LISTS Thursday, September 17, 2009
Combining Fingerprints and Posting Lists By combining fingerprints and posting lists • No verification step is needed • Posting lists are partitioned into smaller subsets. Each bit of the fingerprint corresponds to a separate posting list • Costs for intersection of posting lists are being reduced Thursday, September 17, 2009
Combining Fingerprints and Posting Lists Thursday, September 17, 2009
Managing n-Gram Posting Lists • Very large number of posting-subsets have to be managed: For example: 1024 residue classes for the file. ID 128 residue classes for the offset 14. 000 different n-grams • Subsets are stored in a hash • The hash value is a function of the residue classes Thursday, September 17, 2009
Managing n-Gram Posting Lists Thursday, September 17, 2009
Managing n-Gram Posting Lists hash collisions and collision resolving 40000 . . . collisions. . . comparisons after sorting 35000 frequency 30000 25000 20000 15000 10000 5000 0 0 20 40 60 80 100 number of. . . Thursday, September 17, 2009 120 140
Results • Performance improved by 40% compared to the setup without posting lists Query rhinolo sanfilipo itracon oxyuria Thursday, September 17, 2009 Bitmatrix 230 ms 271 ms 245 ms 210 ms Time for verification 10 ms 15 ms 12 ms Hits 18 0 64 6
FINGERPRINT COMPRESSION Thursday, September 17, 2009
Fingerprint Compression • Fingerprints with high or low densities do not contain much information • Fingerprints can be compressed by reducing the resolution • Dictionary based compression Thursday, September 17, 2009
Fingerprint Compression • Results: Fingerprint convolution Density threshold for convolution no convolution 0 -0, 025 and 0. 975 -1 0 -0. 05 and 0. 95 -1 0 -0. 1 and 0. 9 -1 0 -0. 2 and 0. 8 -1 Performance loss 0% 3. 1 % 3. 2 % 10 % 25 % Fingerprint index reduction 0% 23 % 27 % 29 % 31 % • In combination with the dictionary based compression the index size is being reduced by additional 30% Thursday, September 17, 2009
CONCLUSION AND FUTURE WORK Thursday, September 17, 2009
Conclusion • Fingerprints improve the scalability of n-gram indices • Fingerprints improve the performance of n-gram indices • The index structure can be adjusted to user behavior, so that common queries can be processed more efficiently • The fingerprints can be stored in a compressed index with loosing only a minimum of performance Thursday, September 17, 2009
Future Work • Combination of term based inverted index and n. Gram fingerprint index • Profit from the advantages of both using terms and n. Grams as indexing terms – Substring search – Ranking – Thesaurus information Thursday, September 17, 2009
Thank You! Digital Libraries: Advanced Methods and Technologies, Digital Collections 17. 09. 2009
- Slides: 28