Mining Frequent Closed Cubes in 3 D Datasets
Mining Frequent Closed Cubes in 3 D Datasets Liping Ji Kian-Lee Tan Anthony K. H. Tung Computer Science Department National University of Singapore
Motivation v Frequent Closed Pattern (FCP) Mining: great importance, wide application v Previous works all limited to 2 D FCP mining biological data: gene-time, gene-sample market basket data: transanction-itemset v Extend the 2 D FCP mining to the 3 D context biological data: gene-sample-time marketing data: region-time-items
Background v Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets Transactions t 1: t 2: t 3: t 4: a 1 a 1 a 3 a 2 a 3 a 5 a 2 a 3 a 4 a 5
Background v Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets Transactions t 1: t 2: t 3: t 4: a 1 a 1 a 3 a 2 a 3 a 5 a 2 a 3 a 4 a 5
Background v Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets Transactions t 1: t 2: t 3: t 4: a 1 a 1 a 3 a 2 a 3 a 5 a 2 a 3 a 4 a 5 FP FCP
Background v Binary Mapping t 1: T t 2: t 3: t 4: I a 1 a 2 a 3 a 5 a 1 a 2 a 3 a 4 a 3 a 5 TI t 1 t 2 t 3 t 4 a 1 1 0 a 2 1 1 1 0 a 3 1 1 a 4 0 0 1 0 a 5 1 0 0 1
Background v Binary Mapping t 1: T t 2: t 3: t 4: I a 1 a 2 a 3 a 5 a 1 a 2 a 3 a 4 a 3 a 5 TI t 1 t 2 t 3 t 4 a 1 1 0 a 2 1 1 1 0 a 3 1 1 a 4 0 0 1 0 a 5 1 0 0 1
Frequent Closed Cube v 3 D Dataset Height Slice Row Column
Frequent Closed Cube v Slices by Height Dimension h 1 h 2 h 3
Frequent Closed Cube v Closed Cube: Maximal h 1 h 2 h 3
Frequent Closed Cube v Closed Cube: Maximal h 1 h 2 h 3
Frequent Closed Cube v Definition: Frequent Closed Cube (FCC) Ø Ø Maximal: cannot be extended in any dimension Frequent: satisfy min. H, min. R, min. C threshods
Frequent Closed Cube v Definition: Frequent Closed Cube (FCC) Ø Ø Maximal: cannot be extended in any dimension Frequent: satisfy min. H, min. R, min. C thresholds
RSM vs. Cube. Miner v Representative Slice Mining (RSM) extend existing 2 D FCP mining algorithms for FCC mining v Cube. Miner operate on the 3 D space directly
RSM v Representative Slice (RS) Generation enumerate all possible combination of slices v v 2 D FCP Mining from each RS Post-pruning to Remove Unclosed Cubes If a 2 D FCP is contained in other slices besides its contributing slices, it is unclosed and hence removed; otherwise, it is retained.
RSM v Slices by Height Dimension h 1 h 2 h 3
RSM
RSM v Slices by Height Dimension h 1 h 2 h 3
Cube. Miner Principle
Cube. Miner Principle
Cube. Miner: Cutters Slice h 1 Cutters from h 1
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) Cutter Checking: A. h 1, r 1, c 4 Cutter Checking: check if the Cutter is applicable (A. ) Ø Subset of the node: A. Ø Otherwise: N. A.
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 ) Left Tree: remove Cutter’s left atom h 1 from parent node
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 ) Middle Tree: remove Cutter’s middle atom r 1 from parent node
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) Right Tree: remove Cutter’s right atom c 4 from parent node
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) h 1 , r 2, c 4 c 5 N. A. Next Cutter: checking A. h 1 , r 2, c 4 c 5 A.
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) h 1 , r 2, c 4 c 5 (h 2 h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 3 r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 3 )
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) h 1 , r 2, c 4 c 5 (h 2 h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 3 r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 3 ) Subset Cube
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) h 1 , r 2, c 4 c 5 (h 2 h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 3 r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 3 )
Mining FCC: Cube. Miner Splitting Tree Root (h 1 h 2 h 3 , r 1 r 2 r 3 r 4, c 1 c 2 c 3 c 4 c 5 ) h 1, r 1, c 4 (h 2 h 3, r 1~r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 1~r 4, c 1 c 2 c 3 c 5 ) Left Track Checking h 1 , r 2, c 4 c 5 (h 2 h 3 , r 2~r 4, c 1~c 5 )(h 1~h 3 , r 3 r 4, c 1~c 5 )(h 1~h 3 , r 2~r 4, c 1~c 3 )
Parallelism v RSM Ø v Cube. Miner: Ø v Task: mining of each Representative Slice Task: mining of each branch Processor: Ø Ø Initial: keep a copy of the whole dataset Independent and concurrent with few communication cost
Mining FCC: Experiments v Real yeast cell-cycle regulated genes Ø Ø v Elutriation Experiments: 14*9*7161 CDC 15 Experiments: 19*9*7761 Synthetic Data: IBM data generator Ø Ø Synthetic 1: H*R*C=(8~20)*20*1000 Synthetic 2: H*R*C=100*10000
Experiments: Optimize Cube. Miner Ø Optimal: sort slices by zero decreasing order Ø Prune off infrequent cubes early Elutritration(14*9*7161)
Experiments: Optimize RSM Ø Optimal: enumerate slices by the smallest dimension Ø Slice enumeration takes relatively long processing time Elutritration(14*9*7161)
Experiments: RSM vs. Cube. Miner With the increase of the smallest dimension, Cube. Miner outperforms RSM Synthetic Data (vary size of height dimension)
Experiments: Parallelism Ø As the degree of parallelism increases, the response time decreases. Ø Optimal number of processors CDC 15 (Vary Number of Processors)
Conclusion v Notion of Frequent Closed Cube v RSM: efficient when one of the dimension is small v Cube. Miner: superior for large datasets v Parallel RSM and Cube. Miner
Thank You!
- Slides: 39