Cube Computation and Indexes for Data Warehouses CPS
Cube Computation and Indexes for Data Warehouses CPS 196. 03 Notes 7 1
Processing ROLAP servers vs. MOLAP servers l Index Structures l Cube computation l What to Materialize? l Algorithms l Client Query & Analysis Metadata Warehouse Integration Source 2
ROLAP Server l Relational OLAP Server tools utilities ROLAP server Special indices, tuning; Schema is “denormalized” relational DBMS 3
MOLAP Server Ci ty Multi-Dimensional OLAP Server B A M. D. tools Product l milk soda eggs soap 1 utilities multidimensional server Sales 2 3 4 Date could also sit on relational DBMS 4
Pr od TV PC VCR sum 1 Qtr 2 Qtr Date 3 Qtr 4 Qtr Total annual sales sum of TV in U. S. A Canada Mexico Country uc t MOLAP sum 5
MOLAP C c 3 61 62 63 64 c 2 45 46 47 48 c 1 29 30 31 32 c 0 b 3 B b 2 B 13 14 15 16 28 9 24 b 1 5 b 0 1 2 3 4 a 0 a 1 a 2 a 3 20 44 40 36 60 56 52 A 6
Challenges in MOLAP l Storing large arrays for efficient access u Row-major, column major u Chunking u Compressing sparse arrays Creating array data from data in tables l Efficient techniques for Cube computation l Topics are discussed in the paper for reading 7
Index Structures l Traditional Access Methods u B-trees, l hash tables, R-trees, grids, … Popular in Warehouses u inverted lists u bit map indexes u join indexes u text indexes 8
. . . Inverted Lists age index inverted lists data records 9
Using Inverted Lists l Query: u Get people with age = 20 and name = “fred” List for age = 20: r 4, r 18, r 34, r 35 l List for name = “fred”: r 18, r 52 l Answer is intersection: r 18 l 10
. . . Bit Maps age index bit maps data records 11
Bitmap Index on a particular column l Each value in the column has a bit vector: bit-op is fast l The length of the bit vector: # of records in the base table l The i-th bit is set if the i-th row of the base table has the value for the indexed column l not suitable for high cardinality domains Base table Index on Region Index on Type l 12
Using Bit Maps l Query: u Get people with age = 20 and name = “fred” List for age = 20: 1101100000 l List for name = “fred”: 0100000001 l Answer is intersection: 0100000 l Good if domain cardinality small l Bit vectors can be compressed l 13
Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT WHERE. . . 14
Join Indexes join index 15
Cube Computation for Data Warehouses 16
Counting Exercise l How many cuboids are there in a cube? u The full or nothing case u When dimension hierarchies are present l What is the size of each cuboid? 17
Lattice of Cuboids 129 city, product day 2 all product city, date product, date city, product, date day 1 18
Dimension Hierarchies all state city 19
Dimension Hierarchies all city, product city, date city, product, date state, date state, product, date not all arcs shown. . . 20
Efficient Data Cube Computation l Data cube can be viewed as a lattice of cuboids u The bottom-most cuboid is the base cuboid u The top-most cuboid (apex) contains only one cell u How many cuboids in an n-dimensional cube with L levels? l Materialization of data cube u Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) u Selection è of which cuboids to materialize Based on size, sharing, access frequency, etc. 21
Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) When to update derived data? l Incremental vs. refresh l 22
Idea of Materialized Views l Define new warehouse tables/arrays does not exist at any source 23
Efficient OLAP Processing l Determine which operations should be performed on available cuboids u Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e. g. , dice = selection + projection l Determine which materialized cuboid(s) should be selected for OLAP: u Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? l Explore indexing structures & compressed vs. dense arrays in MOLAP 24
What to Materialize? Store in warehouse results useful for common queries l Example: total sales l day 2 . . . day 1 129 materialize 25
Materialization Factors Type/frequency of queries l Query response time l Storage cost l Update cost l Will study a concrete algorithm later 26
Iceberg Cube l Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup l Motivation u Only a small portion of cube cells may be “above the water’’ in a sparse cube u Only calculate “interesting” cells—data above certain threshold 27
Challenges in MOLAP l Storing large arrays for efficient access u Row-major, column major u Chunking u Compressing sparse arrays Creating array data from data in tables l Efficient techniques for Cube computation l Topics are discussed in the paper for reading 28
- Slides: 28