Fast Vertical Mining Using Diffsets Mohammed J Zaki

  • Slides: 20
Download presentation
Fast Vertical Mining Using Diffsets Mohammed J. Zaki and Karam Gouda The Ninth ACM

Fast Vertical Mining Using Diffsets Mohammed J. Zaki and Karam Gouda The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003 2021/2/21 報告人: 吳建良 1

Abstract n n n Vertical data format Diffset Incorporate into previous vertical mining methods

Abstract n n n Vertical data format Diffset Incorporate into previous vertical mining methods Reduce memory size required to store intermediate results Increase performance 2

Notation I A set of items T Database of transactions tid Identifier of transaction

Notation I A set of items T Database of transactions tid Identifier of transaction itemset A set tidset A set k-itemset An itemset with k items σ(X) The support of an itemset X frequent itemset Its support ≧ min_sup Fk The set of frequent k-itemsets 3

Notation cont. n Powerset P(I) n n Maximal frequent itemset n n search space

Notation cont. n Powerset P(I) n n Maximal frequent itemset n n search space enumeration if it is not a subset of any other frequent itemset Closed frequent itemset (X) n if there is not exist a superset with 4

Example 5

Example 5

Data Format 6

Data Format 6

Lattice Decomposition: Prefix-Based Classes n n n Define an equivalence relation θk on the

Lattice Decomposition: Prefix-Based Classes n n n Define an equivalence relation θk on the lattice P(I) where p(X, k)=X[1: k], the k length prefix of X θk : prefix-based equivalence relation Break the original search space into independent subproblem 7

Subset Search Tree {} {A, C, D, T, W} AD {TW} AC {D, T,

Subset Search Tree {} {A, C, D, T, W} AD {TW} AC {D, T, W} ACD {T, W} C {D, T, W} A {C, D, T, W} AT {W} AW CD {T, W} ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDT {W} CDW CTW CW DT {W} DW W TW DTW CDTW ACDTW 8

Tidset for Pattern Counting 9

Tidset for Pattern Counting 9

Diffset n n Difference of the prefix tidset and a class member tidset Consider

Diffset n n Difference of the prefix tidset and a class member tidset Consider class with prefix P n n Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support 10

Diffset cont. n Define diffset: Then n How to Calculate n using d(PX) and

Diffset cont. n Define diffset: Then n How to Calculate n using d(PX) and d(PY) ? ․ ․ 11

Diffset cont. t(X) t(P) t(Y ) d(PY) d(PXY) t(PXY) 12

Diffset cont. t(X) t(P) t(Y ) d(PY) d(PXY) t(PXY) 12

Diffset Example n Diffset calculation n Support calculation n 13

Diffset Example n Diffset calculation n Support calculation n 13

Diffset Intersection Example 14

Diffset Intersection Example 14

Diffset Example n Total Size n n n Tidsets database size =76 tids Diffsets

Diffset Example n Total Size n n n Tidsets database size =76 tids Diffsets database size =22 tids Size By Length K-itemset (k) 2 3 4 Avg. tidset length 3. 8 3. 2 3 Avg. diffset length 1 0. 6 0 15

d. Eclat: Diffset Based Mining 16

d. Eclat: Diffset Based Mining 16

Experimental Results Average diffset / tidset size by length 17

Experimental Results Average diffset / tidset size by length 17

Experimental Results cont. Database Min_sup (%) # Items #Records Max Length Avg. Diffset Size

Experimental Results cont. Database Min_sup (%) # Items #Records Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0. 5 76 3196 16 26 1820 70 connect 90 130 67557 12 143 62204 435 mushroom 5 120 8124 17 60 622 10 Pumsb* 35 7117 49046 15 301 18977 63 pumsb 90 7117 49046 8 330 45036 136 T 10 I 4 D 100 K 0. 025 100000 11 14 86 6 T 20 I 6 D 100 K 0. 1 100000 14 31 230 11 T 40 I 10 D 100 K 0. 5 100000 18 96 755 8 18

Experimental Results cont. 19

Experimental Results cont. 19

Experimental Results cont. 20

Experimental Results cont. 20