Data Cube A Relational Aggregation Operator Generalizing GroupBy

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross. Tab, and Sub-Totals Jim Gray Adam Bosworth Andrew Layman Hamid Pirahesh Microsoft IBM Presented by: Changwu Li Based on notes by Jim Gray 9/9/2021 1

The Data Analysis Cycle n User extracts data from database with query Spread Sheet Extract analyze Table 1015 visualize n Then visualizes, analyzes data with desktop tools 9/9/2021 Size vs Speed Price vs Speed 104 Cache Nearline 1 Tape Offline Tape Main 1012 102 Secondary Disc Size(B) Online $/MB Online Secondary 9 10 Tape Disc Tape 100 Main 106 Nearline Tape Offline Tape 10 -2 Cache 103 10 -4 -9 -6 -3 0 3 10 10 10 Access Time (seconds) 2

Relational Aggregate Operators n n SQL has several aggregate operators: u sum(), min(), max(), count(), avg() Other systems extend this with many others: u stat functions, financial functions, . . . The basic idea is: u Combine all values in a column into a single scalar value. Syntax select sum(units) from inventory; 9/9/2021 3

Relational Group By Operator n n n Group By allows aggregates over table sub-groups Result is a new table Syntax: select deptno, sum(salary) from emp group by deptno 9/9/2021 4

Problems With This Design n Users Want Histograms F() G() H() n Users want sub-totals and totals u n n sum drill-down & roll-up reports Users want Cross. Tabs AIR HOTEL FOOD MISC M T W T F S S� Conventional wisdom � u These are not relational operators u They are in many report writers and query engines 9/9/2021 5

A cross tab example Table 5 a: Ford Sales Cross Tab 9/9/2021 Ford 1994 1995 black 50 85 total (ALL) 135 white 10 75 85 total (ALL) 60 160 220 6

How to solve this problem? Answer: cube 9/9/2021 7

The Idea: Think of the N-dimensional Cube Each Attribute is a Dimension n N-dimensional Aggregate (sum(), max(), . . . ) u fits relational model exactly: « a 1, a 2, . . , a. N, f() Super-aggregate over N-1 Dimensional sub-cubes « ALL, a 2, . . , a. N , f() « a 3 , ALL, a 3, . . , a. N , f() «. . . « a 1, a 2, . . , ALL, f() u this is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes « ALL, a 3, . . , a. N , f() «. . . « a 1, a 2 , . . , ALL, f() 9/9/2021 8

An Example CUBE 9/9/2021 9

Why the ALL Value? n n n Need a new “Null” value (overloads the null indicator) Value must not already be in the aggregated domain Can’t use NULL since may aggregate on it. Think of ALL as a token representing the set u All(color)={red, white, blue}, u All(year)={1990, 1991, 1992}, u All(model)={Chevy, Ford} Follow “set of values” semantics. 9/9/2021 10

CUBE operator: Syntax n Proposed syntax: select model, make, year, sum(sales) from car_sales where model in {“chevy”, “ford”} and year between 1990 and 1994 group by model, make, year with cube having sum(sales) > 0; n Note: Group By operator repeats aggregate list u in select list u in group by list 9/9/2021 11

How To Compute the Cube? n n If each attribute has Ni values CUBE has P (Ni+1) values Compute N-D cube with hash if fits in RAM Compute N-D cube with sort if overflows RAM Same comments apply to subcubes: u compute N-D-1 subcube from N-D cube. u Aggregate on “biggest” domain first when >1 deep u Aggregate functions need hidden variables: « e. g. average needs sum and count. 9/9/2021 12

Example: n Compute 2 D core of 2 x 3 cube n Then compute 1 D edges n Then compute 0 D point n Works for algebraic and distributive functions Saves “lots” of calls 9/9/2021 13

Real world implementation n n Both Oracle 9 i and SQL server 2000 An example in Oracle 9 i: select deptno, job, sum(sal) as salary from emp group by cube(deptno, job) 9/9/2021 14

DEPTNO JOB SALARY -------------10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 9400 ANALYST 6000 CLERK 4150 MANAGER 8275 PRESIDENT 5000 SALESMAN 5600 2902 9/9/2021 15

1999 ACM Turing Award Jim Gray 9/9/2021 16