Online Analytical Processing OLAP An Overview Kian Win

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates

Motivation • Aggregation, summarization and exploration • Of historical data • To help management

Different Goal • Aggregation, summarization and exploration • Of historical data • To help

Different Requirements • OLTP – On-Line Transaction Processing • OLAP – On-Line Analytical Processing

Query Language Extensions • In the real world, data is stored in RDBs.

Query Language Extensions Problems with GROUP BY 1. histograms SELECT sales, prod_name, population FROM

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown Product Category Drinks

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown Product Category Product

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown 3. cross tabulations

Query Language Extensions Reducing the number of attributes Product Category Product Name Month Sales

Query Language Extensions Reducing the number of attributes • introduce a new value: “ALL”

Query Language Extensions General approach • GROUP BY (1 D) Sales by Product Name

Query Language Extensions General approach • GROUP BY (1 D) Product Category Feb Mar

Query Language Extensions General approach • GROUP BY (1 D) • Cross Tab (2

Query Language Extensions General approach • a CUBE relation, with aggregation function f(. )

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name,

Research Areas • • • SQL language extensions Server architecture Parallel processing Index structures

Simultaneous Multi-Dimensional Aggregates • Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for

Multiple Aggregates Aggregate on… Product City Month Sales Coke San Diego Feb 06 12

Multiple Aggregates City / Product Aggregate on… San Diego Los Angeles Total Altoids 90

Chunking 64 Partition transactional data into array chunks 13 14 15 16 9 10

Naïve Algorithm Dimension A 64 13 14 15 16 9 10 11 12 42

Naïve Algorithm 64 13 14 15 16 9 10 11 12 42 Dimension B

Single Pass Algorithm 64 AB 1 2 3 13 14 15 16 9 10

Single Pass Algorithm 64 AB 13 9 10 11 12 5 6 7 8

Single Pass Algorithm AB 13 Array Chunk 9 10 11 12 5 6 7

Multi Pass Algorithm Recursively aggregate ABCD ABC AB A ACD ABD AC BC AD

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space /

Implementing Data Cubes • V. Harinarayan, A. Rajaraman, J. D. Ullman: Implementing Data Cubes

Implementing Data Cubes Month Day Year Month Time Id City State Week City Id

Implementing Data Cubes • casted as particular case of the rewriting using views problem

Implementing Data Cubes • but cube dimensions are usually hierarchical product_name city day X

Implementing Data Cubes • Def. cost of answering Q = # of rows in

Implementing Data Cubes • For a set S and a view v B(v, S)

Discussion • Questions from the audience…

Slides: 50

Download presentation

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates simultaneously – Materialization strategy

Motivation • Aggregation, summarization and exploration • Of historical data • To help management make informed decisions

Different Goal • Aggregation, summarization and exploration • Of historical data • To help management make informed decisions Product Branch Time Price Coke (0. 5 gallon) Convoy Street 2006 -03 -01 09: 00: 01 $1. 00 Pepsi (0. 5 gallon) UTC 2006 -03 -01 09: 00: 01 $1. 03 Coke (1 gallon) UTC 2006 -03 -01 09: 00: 02 $1. 50 Altoids Costa Verde 2006 -03 -01 09: 01: 33 $0. 30 . . . • Find the total sales for each product and month • Find the percentage change in the total monthly sales for each product

Different Requirements • OLTP – On-Line Transaction Processing • OLAP – On-Line Analytical Processing OLTP Tasks OLAP Day to day operation High level decision support Gigabytes Terabytes Time span Recent, up-to-date Spanning over months / years Size of working set Tens of records, accessed through primary keys Consolidated data from multiple databases Structured / repetitive Ad-hoc, exploratory queries Transaction throughput Query latency Size of database Workload Performance

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates simultaneously – Materialization strategy

Query Language Extensions • In the real world, data is stored in RDBs.

Query Language Extensions • In the real world, data is stored in RDBs. • How to express N-dimensional problems using 2 D tables?

Query Language Extensions • In the real world, data is stored in RDBs. • How to express N-dimensional problems using 2 D tables? • Can we combine OLAP and SQL queries? • Jim Gray et al: Data Cube: A Relational Aggregation Operator 1997

Query Language Extensions Problems with GROUP BY 1. histograms SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown Product Category Drinks Product Name Coke Heineken non relational representation Month Sales Feb 30. 3 Mar 93. 9 Feb 34. 8 Mar 123. 8 Sales by Cat. , by Name Sales by Cat. 124. 2 158. 6 282. 8

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown Product Category Product Name relational, but the rollup is huge Month Sales by Cat. , by Name Sales by Cat. Drinks Coke Feb 30. 3 124. 2 282. 8 Drinks Coke Mar 93. 9 124. 2 282. 8 Drinks Heineken Feb 34. 8 158. 6 282. 8 Drinks Heineken Mar 123. 8 158. 6 282. 8

Query Language Extensions Problems with GROUP BY 1. histograms 2. rollup/drilldown 3. cross tabulations 2 -D aggregation is more compact and more natural: Drinks Feb Mar Total Coke 30. 3 93. 9 124. 2 Heineken 34. 8 123. 8 158. 6 Total 65. 1 217. 7 282. 8

Query Language Extensions Reducing the number of attributes Product Category Product Name Month Sales Drinks Coke Feb 30. 3 Drinks Coke Mar 93. 9 Drinks Coke ALL 124. 2 Drinks Heineken Feb 34. 8 Drinks Heineken Mar 123. 8 Drinks Heineken ALL 158. 6 Drinks ALL 282. 8 Drinks ALL Feb 65. 1 Drinks ALL Mar 217. 7

Query Language Extensions Reducing the number of attributes • introduce a new value: “ALL” Drinks Feb Mar Total (ALL) Coke 30. 3 93. 9 124. 2 Heineken 34. 8 123. 8 158. 6 Total (ALL) 65. 1 217. 7 282. 8 “ALL” = the set over which we aggregate

Query Language Extensions General approach • GROUP BY (1 D) Sales by Product Name Feb Mar Coke 30. 3 93. 9 Heineken 34. 8 123. 8 SUM 65. 1 217. 7

Query Language Extensions General approach • GROUP BY (1 D) Product Category Feb Mar Product Name Month Sales Drinks Coke Feb 30. 3 Drinks Coke Mar 93. 9 Drinks Coke ALL 124. 2 Drinks Heineken Feb 34. 8 Drinks Heineken Mar 123. 8 ALL Drinks Heineken ALL 158. 6 • Cross Tab (2 D) Drinks the corresponding relation: Coke 30. 3 93. 9 124. 2 Drinks ALL Feb 65. 1 Heineken 34. 8 123. 8 158. 6 Drinks ALL Mar 217. 7 ALL 65. 1 217. 7 282. 8 Drinks ALL 282. 8

Query Language Extensions General approach • GROUP BY (1 D) • Cross Tab (2 D) • Cube (3 D) By cat. and name (does it make sense? ) By cat. and month By month and name Product Category Product Name Month Sales Drinks Coke Feb 30. 3 Drinks Coke Mar 93. 9 Drinks Coke ALL 124. 2 … … Snacks Doritos Feb 123. 8 Snacks Doritos Mar 158. 6 Snacks Doritos ALL 65. 1 … … ALL ALL 964. 0

Query Language Extensions General approach • GROUP BY (1 D) • Cross Tab (2 D) • Cube (3 D) • Any hypercube can be represented as a relation!

Query Language Extensions General approach • a CUBE relation, with aggregation function f(. ) (x 1, x 2, …, xn-1, xn, f() ) ……………… (x 1, xn-1, …, xn, ALL, f() ) ……………… (x 1, x 2, …, ALL, xn, f() ) ……………… • after ROLLUP , reduce to a linear # of tuples (x 1, x 2, …, xn-1, xn, f() ) ………………… (x 1, xn-1, …, xn, ALL, f() ) ………………… (x 1, x 2, …, ALL, f() ) ………………… (ALL, …, ALL, f() )

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month Product Category Product Name Month Sales Drinks Coke Feb 30. 3 Drinks Coke Mar 93. 9 Drinks Coke ALL 124. 2 … … Drinks ALL Feb 99. 8 … … ALL ALL 964. 0 Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state Product Category Drinks Snacks Month Day State Product Name Sales Feb 26 CA Coke 12. 3 Feb 26 CA Heineken 5. 4 … … … Feb 26 CA ALL 30. 4 Feb 26 ALL Coke … … … Feb 26 Doritos 12. 0 … … CA

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates simultaneously – Materialization strategy

Research Areas • • • SQL language extensions Server architecture Parallel processing Index structures Materialized views

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates simultaneously – Materialization strategy

Simultaneous Multi-Dimensional Aggregates • Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for Simultaneous Multidimensional Aggregates SIGMOD 1997 • Optimization to calculate multiple aggregates simultaneously • Useful for materialization of aggregate views

Multiple Aggregates Aggregate on… Product City Month Sales Coke San Diego Feb 06 12 Pepsi Los Angeles Feb 06 13 Doritos San Diego Mar 06 72 Altoids San Diego Mar 06 65 . . . Month / Product Feb Mar Total Altoids 36 131 167 Coke 37 138 175 Doritos 21 136 157 Heineken 44 110 154 Pepsi 31 122 153 Pringles 37 126 164 Total 206 764 970

Multiple Aggregates City / Product Aggregate on… San Diego Los Angeles Total Altoids 90 77 167 Coke 89 86 175 Doritos 74 83 157 Product City Month Sales Heineken 74 80 154 Coke San Diego Feb 06 12 Pepsi 68 85 153 Pepsi Los Angeles Feb 06 13 Pringles 73 90 164 Doritos San Diego Mar 06 72 Total 469 501 970 Altoids San Diego Mar 06 65 Month / Product . . . Month / City Feb Mar Total Altoids 36 131 167 Coke 37 138 175 Doritos 21 136 157 Heineken 44 110 154 Los Angeles 112 358 469 Pepsi 31 122 153 San Diego 95 407 501 Pringles 37 126 164 Total 206 764 970

Multiple Aggregates Aggregate on… Product City Month Sales Coke San Diego Feb 06 12 Pepsi Los Angeles Feb 06 13 Doritos San Diego Mar 06 72 Altoids San Diego Mar 06 65 . . . 1. 2. 3. 4. 5. 6. 7. Sales by Product / City Sales by Product / Month Sales by Month / City Sales by Product Sales by City Sales by Month Sales (Total) Is it possible to • make a single pass over the transactional table? • calculate multiple aggregates simultaneously?

Chunking 64 Partition transactional data into array chunks 13 14 15 16 9 10 11 12 42 Dimension B City 5 6 7 36 8 Array Chunk 20 1 1 2 3 4 Dimension C 12 Month Dimension A Product City Month Sales Coke San Diego Feb 06 12

Naïve Algorithm Dimension A 64 13 14 15 16 9 10 11 12 42 Dimension B 5 6 7 36 8 20 1 2 3 4 Dimension C Pivot on AB aggregate on all C Dimension A

Naïve Algorithm 64 13 14 15 16 9 10 11 12 42 Dimension B 5 6 7 36 8 20 1 2 3 4 Dimension C Pivot on AB aggregate on all C Pivot on AC aggregate on all B Pivot on BC aggregate on all A Dimension A

Single Pass Algorithm 64 AB 1 2 3 13 14 15 16 9 10 11 12 4 AC 42 B 5 6 7 36 8 20 1 2 3 4 Dimension C BC Dimension A 12 34 Make a single pass over data

Single Pass Algorithm 64 AB 13 9 10 11 12 5 6 7 8 1 2 3 4 AC 13 14 15 16 9 10 11 12 42 B 5 6 7 36 8 20 159 13 2 6 10 3 7 11 4 5 12 1 2 13 56 78 12 34 4 Dimension C BC 9 10 11 12 3 Dimension A Simultaneously maintain multiple aggregates

Single Pass Algorithm 64 AB 13 9 10 11 12 5 6 7 8 1 2 3 4 AC 13 14 15 16 9 10 11 12 42 B 5 6 7 36 8 20 1 5 9 13 2 6 10 3 7 11 4 5 12 1 2 13 5678 1234 4 Dimension C BC 9 10 11 12 3 Dimension A Write out completed aggregates

Single Pass Algorithm 64 AB 13 9 10 11 12 5 6 7 8 1 2 3 4 AC 13 14 15 16 9 10 11 12 42 B 5 6 7 36 8 20 1 5 9 13 2 6 10 3 7 11 4 5 12 1 2 4 Dimension C BC Dimension A 13 3 Only allocate memory that is necessary

Single Pass Algorithm AB 13 Array Chunk 9 10 11 12 5 6 7 8 ABC 1 2 3 4 4 x 4 x 4 AC 1 5 9 13 BC 2 6 10 3 7 11 AB AC BC 16 x 4 4 x 4 x 4 A B C 4 4 4 5 12 4 x 4 all 1 13 Minimum memory spanning tree

Multi Pass Algorithm Recursively aggregate ABCD ABC AB A ACD ABD AC BC AD C B all BCD BD CD D

Overview • • Motivation Multi-Dimensional Data Model Research Areas Optimizations – Materializing multiple aggregates simultaneously – Materialization strategy

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off: accelerate queries by materializing the cube

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off: accelerate queries by materializing the cube • The size of the relations gets even bigger!

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off: accelerate queries by materializing the cube • The size of the relations gets even bigger! • M(ultidimensional)OLAP: good query performance, but bad scalability • R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes • V. Harinarayan, A. Rajaraman, J. D. Ullman: Implementing Data Cubes Efficiently SIGMOD 1996 Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes Month Day Year Month Time Id City State Week City Id Product Id Sales Product Id Name Category Id Category Name

Implementing Data Cubes • casted as particular case of the rewriting using views problem • what cells to materialize what SQL views to materialize

Implementing Data Cubes • casted as particular case of the rewriting using views problem • what cells to materialize what SQL views to materialize ptc pt tc pc t p c p = product t = time c = city none • simple idea: Q 1 depends on Q 2 (Q 1≤Q 2) if Q 1 can be fully answered using the results of Q 2

Implementing Data Cubes • but cube dimensions are usually hierarchical product_name city day X product_category week month X state none year none p = product t = time c = city • direct-product lattice ptc pt pcatt … … tc pwc … pc pmc pts pyc … ps …

Implementing Data Cubes • Def. cost of answering Q = # of rows in the table of ancestor(Q) • It can be estimated w/o materializing the views • Assume that all queries are identical to some view in the lattice

Implementing Data Cubes • For a set S and a view v B(v, S) = ∑w≤v, (w not in S) max{cost(w)-cost(v), 0} • Greedy algorithm for selecting k views to materialize from the lattice: • 1. S : = {top view} 2. For i=1 to k, add v to S s. t. B(v, S) is maximized The greedy algorithm is an (e-1)/e ≈ 0. 63 approx. of the optimum.

Discussion • Questions from the audience…