Prefix Cube Prefixsharing Condensed Data Cube Jianlin Feng

Prefix. Cube: Prefix-sharing Condensed Data Cube Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. fengjl@mail. hust. edu. cn Nov 12, 2004 DOLAP 2004

Outline l Introduction l Related Work l ODM: Ordered Datacube Model l BST-Condensed Cube l Prefix-sharing Condensed Cube l Comparisons l Conclusions DOLAP 2004 2 2

Introduction l Data Cube (ICDE’ 96) – N-dimensional cube(A 1, A 2, …, AN) – 2 N cuboids, i. e. GROUP-BYs l The Huge Size Problem – When R is sparse, the size of a cuboid is possibly close to the size of R. – The I/O cost even for storing the cube result tuples becomes dominative. DOLAP 2004 3 3

Related Work Condensed Cube (ICDE’ 02) l Dwarf (SIGMOD’ 02) l Quotient Cube (VLDB’ 02) l QC-Tree (SIGMOD’ 03) l Basic idea: remove redundancies existing among cube tuples. l – prefix redundancy – suffix redundancy DOLAP 2004 4 4

Prefix redundancy l Given an example cube(A, B, C) – Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC) – Possibly many times in each cuboid except cuboid(A) l Inter-cuboid and Intra-cuboid prefix redundancy DOLAP 2004 5 5

Suffix Redundancy l l Occurs when cube tuples belonging to different cuboids are actually aggregated from the same group of base relation tuples. An extreme case – Let the source relation R have only one single tuple r(a 1, a 2, …, an, m); – 2 n cube tuples can be condensed into one physical tuple: (a 1, a 2, …, an, V), where V = aggr(r); – together with some information indicating that it is a representative tuple. DOLAP 2004 6 6

Thinking… l Condensed cube – It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size. l Dwarf – Besides suffix coalescing, i. e. multi-basetuple condensing, it also realized full prefixsharing so as to achieve high cube size reducing effectiveness. DOLAP 2004 7 7

Motivation l l HOW to further reduce condensed cube’s size while taking into account query characteristics we intend to answer range query? Augmenting BST-condensing with removing of intra-cuboid prefix redundancy! DOLAP 2004 8 8

Ordered Datacube Model Value ALL(or *) is encoded as 0. l A dimension D and its cardinality C l – each dimension value is one-to-one mapped to an integer value between 1 and C inclusively. N dimensions form a N-dimensional space. l The origin O(0, 0, …, 0) represents the grand total. l DOLAP 2004 9 9

Ordered Datacube Model l Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries. DOLAP 2004 10 10

BST-Condensed Cube l Base Single Tuple (BST) – t 1 is a BST on SD {A} and {B} – t 2 is a BST on SD {B} l A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - Min. Cube. DOLAP 2004 11 11

BU-BST Condensed Cube l l l Bottom. Up. BST algorithms (ICDE’ 02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple from condensed cube compared with Min. Cube. Note: BST Condensing is a special kind of Prefix-sharing ! A group of cube tuples with sharing prefix are represented by a BST! DOLAP 2004 12 12

A BU-BST Condensed Cube Example Note: Intra-cuboid prefix redundancy: ct 3 and ct 4 Inter-cuboid prefix redundancy: ct 2, ct 3 and ct 5 DOLAP 2004 13 13

Prefix-sharing Condensed Cube - Prefix. Cube Prefix-sharing BST Condensing + Intra-cuboid prefix-sharing Prefix. Cube DOLAP 2004 14 14

A Prefix. Cube Example DOLAP 2004 15 15

Corresponding Dwarf DOLAP 2004 16 16

Prefix. Cube vs. Dwarf Prefix. Cube Dwarf Prefix-sharing Intra-cuboid Inter- and Intra-cuboid Suffix Coalescing BST Condensing Multi-tuple Condensing Compression Ratio Lower Higher Saving extra value ALL? No Yes Tuple clustered by cuboid? Yes No DOLAP 2004 17 Prefix. Cube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics! 17

Effectiveness of Size Reduction l Datasets – synthetic datasets with uniform distribution – # of tuples: 1, 000 (a) Cardinality = 100 DOLAP 2004 (b) Cardinality = 1000 18 18

Effectiveness of Size Reduction l Prefix. BUC – Full Cube (computed by BUC) – Prefix-sharing DOLAP 2004 19 19

Impact of Data Density l Datasets – – Uniform distribution # of dimensions: 6 Cardinality of dimensions: 100 # of tuples: range from 1, 000 to 1, 000 DOLAP 2004 20 20

Impact of Data Skewness l Datasets – Zipf distribution – # of tuples: 1, 000 – Cardinality of dimensions: range from 1, 000 to 500 with 100 interval – Zipf factor: range from 0 to 0. 8 with 0. 2 interval DOLAP 2004 21 21

Real-world Dataset l Datasets – Weather Datasets – # of tuples: 1, 015, 367 DOLAP 2004 22 22

Conclusion l A new cube structure Prefix. Cube was proposed by augmenting BU-BST condensing with intra-cuboid prefixsharing. – It can greatly reduce data cube’s size compared with BU-BST condensed cube. – It can also reduce the impact of data skew on BU-BST condensing. – It can make a quite stable size reduction on both dense and sparse datasets. DOLAP 2004 23 23

The End Thank u! Any question? DOLAP 2004 24 24