Efficient Bitmap Indexes for Very Large Datasets John

Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory September, 2002

Outline • Introduction — Example application: high-energy physics data — Task: range queries on high-dimensional data — Approach: bitmap index — To make it work: compression, encoding, binning • New compression scheme — Best known scheme (BBC): CPU bound — Improve CPU efficiency: 10 X • Compressed bitmap index — Index size smaller than b-tree — Answer queries faster than b-tree, … • Applying bitmaps for a feature tracking problem September, 2002

Example I: High-energy Physics Typical data processing steps: • Collect raw data: collision events, … (done once) • Generate summary data (done once): 10 -100 attributes per event • Access data according to summary attributes (performed by many scientists): 20001015<=Run & 200<Energy<300 … Selected attributes of STAR summary data (tags). Actual size (January 2002): 20 million objects, 502 attributes OID Run Event NLb 01239029 2635 11239029 2636 21239029 2637 tpc 1341 909 1470 1243 1663 1285 Tracks Particles 1228 1415 1533 September, 2002 266 317 281 Vertex. 56. 46. 53 qxb[2] Energy -26. 40 -29. 08 -6. 754 48 53 8

Range Queries on High-dimensional Data • Typical query: partial range query 20001015<=Run & 200<Energy<300 … • Characteristics of data — Large: millions or billions of records — High-dimensional: hundreds of attributes per object — Appends in batches — Most attributes are not categorical (integer, floatingpoint values) • Known solutions — Sequential scan — R-tree etc. are usually slower than sequential scan — Bitmap index is faster in some cases September, 2002

Basic Bitmap Index Bitmap index is efficient for processing range queries on read-only data (P. O’Neil, 1987). NLb=0 NLb=6 The basic bitmap index NLb=1 Qxb[2] NLb 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 event. Time 0 0 0 0 . . . September, 2002 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

Features of Bitmap Index 1. Main operations are bitwise logical operations and they are fast 2. Index sizes are small for categorical attributes with low cardinality 3. Each individual bitmap is small and frequently used ones can be cached in memory 4. Scientific datasets have mostly non-categorical attributes 1. Index size may be large 2. Query processing may be slow September, 2002

Effective Bitmap Index To make bitmap index effective for scientific datasets: 1. Binning: reduce the number of bitmaps — Say 0 <= NLb < 4000, we can use 20 equal size bins [0, 200)[200, 400)[400, 600) 2. Encoding: reduce the number of bitmaps or reduce the number of operations — Basic: equality encoding: generates on bitmap for each bin (shown above) — Other: range encoding, interval encoding, … 3. Compression: reduce the size of each bitmap, may also speedup the logical operations — Find an efficient compression scheme to reduce query processing time — This talk only addresses the issue of compression September, 2002

Efficient Compression Schemes Word-aligned Hybrid Code September, 2002

Efficient Compression Schemes Best known compression scheme for bitmap indexes --byte-aligned bitmap code (BBC) — Uses run-length encoding — Encode/decode bitmaps 8 bits (one byte) at a time — Compresses nearly as well as LZ 77 (gzip) — Bitwise logical operations can be performed on compressed bitmaps directly — Operations are usually faster compared to other compression schemes, e. g. , Exp. Gol, … — Even faster than operating on uncompressed bitmaps in some cases — Used in ORACLE September, 2002

Operations With BBC Is CPU Bound CPU time is about 80% of total time on a system with 20 MB/s disk suite Two independent implementations of BBC show similar behavior Operation measured: read two files from disk and perform one logical operation in memory Bitwise logical operations on BBC compressed bitmaps are CPU bound F Reduce CPU time September, 2002

Word-Aligned Hybrid Code Word-aligned hydride code (WAH) — Uses run-length encoding for long sequences of identical bits — Encode / decode bitmaps in word size chunks — Designed for minimal decoding to gain speed September, 2002

Word-Aligned Hybrid Code 1023 bits 10000000000111000000000000000………………. 00000000000000001111111111111 Groups bits into 33 31 -bit groups 31 bits … 31 bits Merge neighboring groups with identical bits 31*31 bits Encode each group using one word 01000… 11111 001… 111 Literal word Fill word Literal word WAH includes three words September, 2002 Run length is 31

Information About the Test Setup • Hardware and system — Sun enterprise 450 (Ultrasparc II 400 MHz) — VARITAS volume manager (stripped disk) – measured IO speed 20 MB/s • Real application data from STAR — About 2. 2 million records, 500 attributes • Synthetic data — 100 million records, 10 attributes • Terms — Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size — Time reported are wall clock time in seconds September, 2002

Fraction of Time Spent in CPU On a 2 MB/s disk system On a 20 MB/s disk system Compared to two implementations of BBC, WAH spends smaller fraction of time in CPU September, 2002

Logical Operation Time Synthetic data 100 million records WAH is 2 -20 times faster than BBC September, 2002

Logical Operation Time STAR data 2. 2 million records WAH is 2 -60 times faster than BBC September, 2002

Trade-off of Compression Schemes speed uncompressed WAH better BBC gzip Pac. Bits Exp. Gol space September, 2002

Performance of the Full Queries Using the Basic Bitmap Index Bitmap index setup: • One bitmap per value (no bins) • Equality encoding What is being measured Ø Time – answering range queries (not individual logical operation): § high cardinality attributes from STAR September, 2002

WAH index scales linearly with data size Range Queries over different datasets STAR: 2. 2 mil Combustion: 25 Synthetic: 100 Query processing time is proportional to index size 1 sec 100 MB September, 2002

Multi-attribute Range Queries High Cardinality Attributes 2 attributes per query 5 attributes per query • WAH compressed indexes are 10 X faster than ORACLE, 5 X faster than our BBC • P scan is scanning vertically projection of data table – the simplest option for processing partial range queries on high-dimensional data • Queries on 12 most queried attributes, average cardinality 222, 000 September, 2002

Summary of Tests on STAR Data Exact answers Indexing Size Method (X data) P Scan 0 Native 20 bins 0. 18 vertical 0. 43 partition 50 bins 100 bins 0. 90 (WAH) No bins 1. 65 Scan B-tree ORACLE Bitmap (no bins) 0 3. 6 0. 98 Time relative (sec) to p scan 0. 57 1 0. 11 5 0. 07 8 0. 05 11 6. 5 0. 95 0. 66 0. 09 0. 6 0. 86 Approximate answers Time relative (sec) to p scan 0. 01 60 60 60 WAH vs. BBC Our bitmap index can be 100 X faster than ORACLE: 10 X due to compression scheme, 10 X due to binning September, 2002

Using Bitmaps for Feature Tracking Adopting Compressed Bitmaps to Operations Outside of the Bitmap Index September, 2002

Example II: Combustion • Direct numerical simulation of autoignition process (solution of complex partial differential equations – data computed once but never modified) • A simple model has 12 variables per cell, a realistic model may have hundreds • Number of grid points: 2 D 600 X 600 >>> 3 D 1000 X 1000 • Time steps: 100 >>> 1000 s • Data size: 1 GB >>> 10 TB • Task: identify features and track them across time steps September, 2002

Tasks • Cell identification — Identify cells with values satisfying specified conditions — Typically a partial range query, like, “ 600<temp<700 & HO 2>10 -7” • Region growing (feature identification) — Connect neighboring cells into connected regions • Feature tracking — Identify common cells in connected regions from different time steps September, 2002

Basic Approach • Cell identification — Scan data and perform comparisons — Solution is represented as a list of cell IDs • Region growing — For each cell in the above list, search all its neighbors — Each region is a list of cell IDs • Feature tracking — Sort cell IDs of each region and match cell IDs to identify common cells — Use bounding boxes to reduce unnecessary operations September, 2002

Our Approach • Cell identification — Vertically partition the data — Use bitmap index to speedup searches — Solutions are represented as compressed bitmaps • Region growing — Convert the compressed bitmaps into line segments — Connect neighboring line segments into regions — Convert each region into a compressed bitmap • Feature tracking — Use bitwise AND to identify common cells — Use bounding boxes to reduce unnecessary operations September, 2002

Preliminary Performance Data Cell identification Horizontal partition 75 seconds Vertical partition 5 seconds Bitmap index 0. 1 seconds Region growing Point based algorithm 8 seconds Line based algorithm 1. 7 seconds Feature tracking Comparing cell Ids 10 seconds Bitmap operations 0. 2 seconds Total time (sec) 93 23 2. 0 69 time steps, 600 X 600 grid, condition HO 2>10 -7 Compressed bitmaps can be efficiently used for feature tracking September, 2002

Summary • The size of WAH compressed bitmap index is modest even in the worse case — For most high cardinality attributes with N records, the index size is about 2 N words. Never more than 4 N words • The WAH compressed index is efficient on attributes of any cardinality — On range queries, it is faster than uncompressed bitmap index (3 X), BBC compressed index (2~20 X), B+-tree index (20~200 X), and scanning vertically partitioned table (4~50 X) • Compressed bitmaps can also be efficiently used for feature tracking September, 2002

Sizes of Compressed Bitmap Indexes 108 records Test attribute: 1, 2, 3, …, 1, 2, 3, … (worst case in terms of index size) B+-tree size (observed): 3~4 x 108 words WAH compressed index is not larger than B+-tree September, 2002

Summary of Tests on STAR Data (I) Size (MB) Low Query 1 -attribute cardinality processing 2 -attribute case (seconds) 5 -attribute Size (MB) High Query 1 -attribute cardinality processing 2 -attribute case (seconds) 5 -attribute B+-tree P scan 370 0 0. 90 0. 51 2. 10 0. 56 2. 14 0. 67 408 0 0. 95 0. 51 2. 15 0. 56 2. 23 0. 67 Bitmap index Oracle BBC WAH 7 4 7 0. 005 0. 015 0. 004 0. 026 0. 006 0. 043 0. 083 0. 017 111 118 186 0. 01 0. 03 0. 05 0. 39 0. 17 0. 04 2. 42 0. 76 0. 17 • Compressed bitmap index is more efficient for range queries than B+-tree or no index (p scan) • A WAH compressed index uses more space than a BBC compressed index, but is more efficient September, 2002

Multi-attribute Range Queries Low Cardinality Attributes 2 attributes per query 5 attributes per query • WAH compressed indexes are faster than BBC compressed indexes (3 X) and uncompressed indexes (3 X) • Query box is the relative volume of the box formed by the query condition • 12 lowest cardinality attributes of star, average attribute cardinality 26 September, 2002

Total Effect of Compression and Encoding Schemes • Bottom line on queries — Compression scheme determines efficiency of logical operations — Encoding scheme determines number of operations • Range & interval – only one logical operation over 2 bitmaps • Equality – many operations depending on number of bins — But, space may be a consideration • What is the trade-off? September, 2002

Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries September, 2002

Storing Bitmaps As Files Is Efficient BMI – store bitmaps in Objectivity IBIS – store bitmaps in files • IBIS answers queries about 4 times faster than BMI using WAH • BMI with WAH is up to ten times faster than BMI with BBC Joint work with Kurt Stockinger (CERN) September, 2002