Speaker Alexander Behm SpaceConstrained GramBased Indexing for Efficient
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China
Speaker: Alexander Behm Motivation: Data Cleaning Should clearly be “Niels Bohr” Source: http: //en. wikipedia. org/wiki/Heisenberg's_microscope, Jan 2008 Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Motivation: Record Linkage Phone … … … Age … … … Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger Hobbies … … … Address … … … Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Motivation: Query Relaxation Actual queries gathered by Google http: //www. google. com/jobs/britney. html Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm What is Approximate String Search? Query against collection: Find entries similar to “Arnold Schwarseneger” What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similarity - Dice - Etc. String Collection Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger … How can we support these types of queries efficiently? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Approximate Query Answering irvine Sliding Window 2 -grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Approximate Query Example Query: “irvine”, Edit Distance 1 2 -grams {ir, rv, vi, in, ne} Lookup Grams 2 -grams Inverted Lists (string. IDs) … in tf vi ir ef rv ne un 1 3 4 5 7 9 5 9 1 5 1 2 3 9 7 9 5 6 9 1 2 4 5 6 … Count >= 3 Candidates = {1, 5, 9} May have false positives Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm T-Occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Motivation: Compression Inverted Index >> Source Data Fit in memory? Space Budget? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Motivation: Related Work m IR: lossless compression of inverted lists (disk-based) m Delta representation + compact encoding m Inverted lists in memory: decompression overhead m Tune compression ratio? m Overcome these limitations in our setting? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Main Contributions Two lossy compression techniques m Answer queries exactly m Index fits into a space budget m Queries faster on the compressed indexes m Flexibility to choose space / time tradeoff m Existing list-merging algorithms: re-use + compression specific optimizations Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Overview m Motivation & Preliminaries Ø Approach 1: Discarding Lists m Approach 2: Combining Lists m Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Approach 1: Discarding Lists 2 -grams Inverted Lists (string. IDs) … in tf vi ir ef rv ne un 1 3 4 5 7 9 5 9 1 5 1 2 3 9 7 9 5 6 9 1 2 4 5 6 … Lists discarded, “Holes” Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Effects on Queries m Decrease lower bound T on common grams m Smaller T more false positives m T <= 0 “panic”, scan entire string collection m Surprise Fewer lists Faster Queries (depends) Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Query “shanghai”, Edit Distance 1 3 -grams {sha, han, ang, ngh, gha, hai} 3 -grams uni ing sha han ang ngh gha Basis: Edit Operations “destroy” q=3 grams No Holes: T = #grams – ed * q = 6 – 1 * 3 = 3 With holes: T’ = T – #holes = 0 Panic! Really destroy q=3 grams per edit operation? hai ter … Hole grams Regular grams Dynamic Programming for tighter T Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Choosing Lists to Discard Effect on Query Unaffected Panic Slower or Faster m Good choice depends on query workload m Space budget: Many combinations of grams m Make a “reasonable” choice efficiently? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Choosing Lists to Discard INPUT: Space Budget, Inverted lists, Workload … in tf vi ir ef rv ne Estimated impact ∆t OUTPUT: Lists to discard un … Choose one list at a time Incremental Update Query 1 Query 2 Query 3 … Total estimated running time t ALGORITHM: Greedy & Cost-Based Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Estimating Query Times List-Merging: cost function, offline with linear regression Panic: #strings * avg similarity time Post-Processing: #candidates * avg similarity time Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Estimating #candidates Incremental-Scan. Count Algorithm un Counts 2 3 0 1 4 String. IDs 0 1 2 3 4 1 3 4 List to Discard BEFORE T=3 #candidates = 2 Decrement Counts String. IDs 2 2 0 0 3 1 2 3 4 0 AFTER T’ = T-1 = 2 #candidates = 3 Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Overview m Motivation & Preliminaries m Approach 1: Discarding Lists Ø Approach 2: Combining Lists m Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Approach 2: Combining Lists 2 -grams Inverted Lists (string. IDs) … in tf vi ir ef rv ne un 1 3 4 5 7 9 5 6 9 1 2 3 9 1 3 9 7 9 6 9 1 2 4 5 6 … Lists combined Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Effects on Queries m Lower bound T is unchanged (no new panics) m Lists become longer: m More time to traverse lists m More false positives Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Speeding Up Queries Query 3 -grams {sha, han, ang, ngh, gha, hai} combined lists refcount = 2 combined lists refcount = 3 Traverse physical lists once. Count for string. IDs increases by refcount. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Choosing Lists to Combine m Discovering candidate gram pairs m Frequent q+1 -grams correlated adjacent q-grams m Locality-Sensitive Hashing (LSH) m Selecting candidate pairs to combine m Basis: estimated cost on query workload m Similar to Discard. Lists m Different Incremental Scan. Count algorithm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Overview m Motivation & Preliminaries m Approach 1: Discarding Lists m Approach 2: Combining Lists Ø Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Experiments Datasets: m Google Web. Corpus Word Grams m IMDB Actors m DBLP Titles Overview: m Performance & Scalability of Discard. Lists & Combine. Lists m Comparison with IR compression & VGRAM m Changing workloads 10 k Queries: Zipf distributed, from dataset q=3, Edit Distance=2, (also Jaccard & Cosine) Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Experiments Discard. Lists Runtime decreases! Combine. Lists Runtime decreases! Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Comparison with IR compression Carryover-12 Compressed Uncompressed Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Comparison with variable-length grams, VGRAM Uncompressed Compressed Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Future Work m Combine: Discard. Lists, Combine. Lists and IR compression m Filters for partitioning, global vs. local decisions m Dealing with updates to index Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Conclusions Two lossy compression techniques m Answer queries exactly m Index fits into a space budget m Queries faster on the compressed indexes m Flexibility to choose space / time tradeoff m Existing list-merging algorithms: re-use + compression specific optimizations Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm Thank You! This work is part of The Flamingo Project http: //flamingo. ics. uci. edu Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm More Experiments What if the workload changes from the training workload? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
Speaker: Alexander Behm More Experiments What if the workload changes from the training workload? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai
- Slides: 34