Efficient Bitmap Indexing Techniques for Very Large Datasets
Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani March, 2002
Problem Statement • Main objective: maps logical requests to qualified objects — A logical request: • 20001015<=event. Time & 200<energy<300 … — Objects: • Set of object ids; • Set of files containing the objects; • Offsets within the files, … March, 2002
Application: STAR OID dst hist m. Event Number m. Event Time m. Run Number NLb 0 159625 159627 2635 20000827. 0 11759 1239029 1341 1 159625 159627 2636 20000827. 0 11759 1239029 1470 2 159625 159627 2637 20000827. 0 11759 1239029 1663 OID n_clus_tpc_ number. Of in[13] Primary Tracks Charged Primary Particles_M Vertex. X eans[1] qxb[2] zdc 2 Energy 0 909 1228 266 . 56 -26. 40 48 1 1243 1415 317 . 46 -29. 08 53 2 1285 1533 281 . 53 -6. 754 8 A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes. March, 2002
Application: Combustion • Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) • A dozen or more variables are computed at each time step and each grid point • Number of grid points: 2 D 600 X 600 >>> 3 D 1000 X 1000 • Time steps: 100 >>> 1000 s • Data size: 1 GB >>> 10 TB • Task: identify features and track them across time steps • E. G. Find flame front across time Find “ 600<temp<700” for 1 billion points per time step, and discoverlap between time steps • Use compressed bitmaps to accelerate both feature extraction and feature tracking March, 2002 1000 X 1000
Building a Bitmap Index 1. Partition each property into bins (binning) — e. g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200, 400)… 2. Generate a bit vector for each bin (encoding) — Bit i of bit vector j is 1 iff NLb[i] is in bin j 3. Compress each bit vector property 2 property 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 property n 0 0 0 0 . . . March, 2002 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
Advantages of Bitmap Index • Bitmap index: specialized index that takes advantage — Read-mostly data: data produced from scientific experiments can be appended in large groups • Fast operations — “Predicate queries” can be performed with bitwise logical operations • Predicate ops: =, <, >, <=, >=, range, • Logical ops: AND, OR, XOR, NOT — They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory March, 2002
Operation-efficient Compression Methods • Best known: byte-aligned bitmap code (BBC) — Uses run-length encoding (next slide) — Byte alignment, optimized for space efficiency — Decoding on bit level, not optimal for operations — Used in oracle • We developed a new word-aligned scheme: WAH — Uses run-length encoding — Word alignment — Designed for minimal decoding to gain speed March, 2002
Operation-efficient Compression Methods Based on variations of Run Length Compression Uncompressed: 000000111100000. . . 0000001000011110000. . 000000 Compressed: 12, 4, 1000, 1, 8, 1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data March, 2002
Trade-off of Compression Schemes speed uncompressed WAH better BBC gzip Pac. Bits Exp. Gol space March, 2002
Information About the Test Machines • Hardware and system — Sun enterprise 450 (Ultrasparc II 400 mhz) — 4 GB RAM — VARITAS volume manager (stripped disk) • Real application data from STAR — Above 2 million objects, 12 attributes • Synthetic data — 100 million objects, 10 attributes • Terms — Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size — Time reported are wall clock time in seconds March, 2002
Logical Operation Time(Synthetic Data) 10 X improvement March, 2002
Logical Operation Time (STAR Data) Also 10 X improvement March, 2002
Encoding Schemes – Main Idea Interval encoding Range encoding Equality encoding 12 bins 1 2 3 4 5 6 7 8 9 10 11 12 Interval, Range encoding: operates on 2 bins only! March, 2002
Total Effect of Compression and Encoding Schemes • Bottom line on queries — Compression scheme determines efficiency of logical operations — Encoding scheme determines number of operations • Range & interval – only one logical operation over 2 bitmaps • Equality – many operations depending on number of bins — But, space may be a consideration • What is the trade-off? March, 2002
Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries March, 2002
Timing Results Method ORACL E Scan Native vertical partition Scan Index (X data) Time (sec) Speed 0 6 0. 1 3. 6 0. 95 0. 6 0 0. 57 1 20 bins 0. 18 0. 11 5 50 bins 0. 43 0. 07 8 100 bins 0. 90 0. 05 11 B-tree March, 2002
Summary • Compressed bitmap indices are effective for range queries • Better compression scheme — 50% more space, but 12 time faster !!! • Among the different encoding schemes — The interval encoding is the overall winner March, 2002
Future Work • Support NULL value and categorical values • On-line update: add new data and update index without interrupting request processing • Recovery mechanism for robustness • Potential new applications: climate, astrophysics, biology (microarrays) • Study non-uniform binning strategies • Study more encoding schemes • Integrate with conventional database system: to better handle metadata, to provide more versatile front-end March, 2002
How Many Bins for Continuous Domains? Range(y) Edge bin . . . Edge bin Range(x) . . . . More bins Less objects in edge bins Searching edge bins: skip-scan over “attribute vertical partition” March, 2002
- Slides: 19