CLOSET An Efficient Algorithm for Mining Frequent Closed

  • Slides: 28
Download presentation
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets Jian Pei, Jiawei Han and

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets Jian Pei, Jiawei Han and Runying Mao Intelligent Database Systems Research Lab. School of Computing Science Simon Fraser University Email: {peijian, han, rmao}@cs. sfu. ca http: //www. cs. sfu. ca/~{peijian, han, rmao}

Outline why mining frequent closed itemsets? CLOSET: an efficient method Performance study and experimental

Outline why mining frequent closed itemsets? CLOSET: an efficient method Performance study and experimental results Conclusions

Mining Frequent Itemsets Given a transaction database and a support threshold, mining frequent itemsets

Mining Frequent Itemsets Given a transaction database and a support threshold, mining frequent itemsets is to find the complete set of frequent itemsets Mining frequent itemsets is essential for many data mining tasks, e. g. association, etc. Mining frequent itemsets and association rules over them often generates a large number of frequent itemsets and rules n n Harm efficiency Hard to understand

From Frequent Itemsets to Frequent Closed Itemsets Mining frequent closed itemsets has the same

From Frequent Itemsets to Frequent Closed Itemsets Mining frequent closed itemsets has the same power as mining the complete set of frequent itemsets, but it substantially reduces redundant rules to be generated n Increase both efficiency and effectiveness TDB (a 1 a 2…a 100) (a 1 a 2…a 50) 2100 -1 frequent itemsets a 1, …, a 100, a 1 a 2, …, a 99 a 100, …, a 1 a 2…a 100 A tremendous number of association rules! min_sup=1 min_conf=50% 2 frequent closed itemsets a 1 a 2…a 100, a 1 a 2…a 50 1 rule a 1 a 2…a 50 a 51 a 52…a 100

What Is Frequent Closed Itemset? An itemset X is a closed itemset if there

What Is Frequent Closed Itemset? An itemset X is a closed itemset if there exists no itemset Y such that every transaction having X contains Y A closed itemset X is frequent if its support passes the given support threshold The concept is firstly proposed by Pasquier et al. in ICDT’ 99 and Information Systems Vol. 24, No. 1, 1999

How to Generate Rules on Frequent Closed Itemsets? Rule X Y is an association

How to Generate Rules on Frequent Closed Itemsets? Rule X Y is an association rule on frequent closed itemsets if n n n Both X and X Y are frequent closed itemsets There exists no frequent closed itemset Z such that X Z (X Y) The confidence of the rule passes the given threshold Given rules X Y and X Y Z, the rule X Y Z is redundant!

How to Mine Frequent Closed Itemsets? A-Close [PBTL 99] n n n Using the

How to Mine Frequent Closed Itemsets? A-Close [PBTL 99] n n n Using the A-priori framework Pruning redundancies in candidates Post-processing to generate complete but non -duplicate result Ch. ARM [Za. Hs 00] n n Exploring a vertical data format Finding frequent closet itemsets by computing intersections of sets of transaction ids for itemsets CLOSET: our method presented here

How CLOSET Works? An Example Transaction Items ID 10 a, c, d, e, f

How CLOSET Works? An Example Transaction Items ID 10 a, c, d, e, f 20 30 40 50 min_sup =2 a, b, e c, e, f a, c, d, f c, e, f Step 1. Find frequent items List of frequent items in support descending order f_list=<c: 4, e: 4, f: 4, a: 3, d: 2>

Divide Search Space All frequent closed itemsets can be divided into 5 non-overlap subsets

Divide Search Space All frequent closed itemsets can be divided into 5 non-overlap subsets based on f_lsit n n n The The The ones ones containing containing Transaction ID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f d a but no d f but no a nor d e but no f, a nor d only c f_list=<c: 4, e: 4, f: 4, a: 3, d: 2>

Find Subsets of Frequent Closed Itemsets by Constructing Conditional Databases Let a be a

Find Subsets of Frequent Closed Itemsets by Constructing Conditional Databases Let a be a frequent item in TDB. The aconditional database, denoted as TDB|a, is the subset of transactions in TDB containing a, and all occurrences of infrequent items, item a, and items following a in f_list are omitted Let b be a frequent item in X-conditional database TDB|X, the b. X-conditional database, denoted as TDB|b. X, is the subset of transactions in TDB|X containing b and all the occurrences of local infrequent items, item b, and items following j in local f_list. X are omitted

Find Frequent Closed Itemsets Containing d TDB cefad ea cef cfad cef Local frequent

Find Frequent Closed Itemsets Containing d TDB cefad ea cef cfad cef Local frequent items: c, f, a TDB|d (d: 2) cefa cfa F. C. I. : cfad: 2 Every transaction having d also contains c, f and a TDB|a (a: 3) cef e cf f_list: <c: 4, e: 4, f: 4, a: 3, d: 2> TDB|f (f: 4) ce: 3 c F. C. I. : cf: 4, cef: 3 F. C. I. : a: 3 TDB|ea (ea: 2) c F. C. I. : ea: 2 TDB|e (e: 4) c: 3 F. C. I. : e: 4

Find Frequent Closed Itemsets Containing a but No d Frequent closed itemsets containing a

Find Frequent Closed Itemsets Containing a but No d Frequent closed itemsets containing a but no d can be further partitioned into subsets w. Ones having af but no d w. Ones having ae but no d nor f w. Ones having ac but no d, e nor f TDB|d (d: 2) cefa cfa F. C. I. : cfad: 2 TDB cefad ea cef cfad cef TDB|a (a: 3) cef e cf f_list: <c: 4, e: 4, f: 4, a: 3, d: 2> TDB|f (f: 4) ce: 3 c F. C. I. : cf: 4, cef: 3 F. C. I. : a: 3 sup(fa)=sup(cfad) No FCI having fa or ca but no d TDB|ea (ea: 2) c F. C. I. : ea: 2 TDB|e (e: 4) c: 3 F. C. I. : e: 4

Find Frequent Closed Itemsets Containing f but No a Nor d TDB cefad ea

Find Frequent Closed Itemsets Containing f but No a Nor d TDB cefad ea cef cfad cef TDB|d (d: 2) cefa cfa F. C. I. : cfad: 2 TDB|a (a: 3) cef e cf f_list: <c: 4, e: 4, f: 4, a: 3, d: 2> TDB|f (f: 4) ce: 3 c F. C. I. : cf: 4, cef: 3 F. C. I. : a: 3 TDB|ea (ea: 2) c F. C. I. : ea: 2 TDB|e (e: 4) c: 3 F. C. I. : e: 4

Find Frequent Closed Itemsets Containing e but No f, a Nor d TDB cefad

Find Frequent Closed Itemsets Containing e but No f, a Nor d TDB cefad ea cef cfad cef TDB|d (d: 2) cefa cfa F. C. I. : cfad: 2 TDB|a (a: 3) cef e cf f_list: <c: 4, e: 4, f: 4, a: 3, d: 2> TDB|f (f: 4) ce: 3 c F. C. I. : cf: 4, cef: 3 F. C. I. : a: 3 TDB|ea (ea: 2) c F. C. I. : ea: 2 TDB|e (e: 4) c: 3 F. C. I. : e: 4

Find Frequent Closed Itemsets Containing Only c sup(c)=sup(cf), c is not a closed itemset

Find Frequent Closed Itemsets Containing Only c sup(c)=sup(cf), c is not a closed itemset In summary, the set of frequent closed itemsets is {acdf: 2, a: 3, ae: 2, cf: 4, cef: 3, e: 4}

Optimization 1: Compress Transactional & Conditional Databases Using FP-trees FP-tree compresses databases for frequent

Optimization 1: Compress Transactional & Conditional Databases Using FP-trees FP-tree compresses databases for frequent itemsets Conditional databases can be derived from FP-tree efficiently Please refer our SIGMOD’ 00 paper for details

Optimization 2: Extract Items Appearing in Every Transaction of Conditional Database Let Y be

Optimization 2: Extract Items Appearing in Every Transaction of Conditional Database Let Y be the set of items appearing in every transaction of the X-conditional database, X Y is a potential frequent closed itemset This optimization takes effect before constructing the FP-tree for the conditional database Benefits n n Reduce the size of FP-tree Reduce the levels of recursions

Optimization 3: Directly Extract Frequent Closed Itemsets From FP-tree Benefits n n n Identify

Optimization 3: Directly Extract Frequent Closed Itemsets From FP-tree Benefits n n n Identify frequent closed itemsets quickly Reduce the size of the remaining FP-tree to be examined Reduce the levels of recursions root a: 7 abc: 7 b: 7 c: 7 d: 5 abcdef: 4 e: 4 f: 4 abcd: 5

Optimization 4: Prune Search Branches If X Y, sup(X)=sup(Y) and Y is a frequent

Optimization 4: Prune Search Branches If X Y, sup(X)=sup(Y) and Y is a frequent closed itemset, there is no need to search for X-conditional database for frequent closed itemset n Any frequent closed itemset having X must contain Y-X as well Benefits n Avoid search for subsumed frequent itemsets

Scaling up CLOSET in Large Database Using projected databases in place of FP-trees Partition-based

Scaling up CLOSET in Large Database Using projected databases in place of FP-trees Partition-based projection TDB|d (d: 2) cefa cfa F. C. I. : cfad: 2 TDB cefad ea cef cfad cef TDB|a (a: 3) cef e cf f_list: <c: 4, e: 4, f: 4, a: 3, d: 2> TDB|f (f: 4) ce: 3 c F. C. I. : cf: 4, cef: 3 F. C. I. : a: 3 TDB|ea (ea: 2) c F. C. I. : ea: 2 TDB|e (e: 4) c: 3 F. C. I. : e: 4

Performance Study Test takers n A-Close n Ch. ARM n CLOSET Datasets n Synthetic

Performance Study Test takers n A-Close n Ch. ARM n CLOSET Datasets n Synthetic dataset T 25 I 20 D 100 k with 10 k items n Connect-4 n Pumsb

Compactness of Frequent Closed Itemsets Example: Dataset Connect-4 Support 64179 (95%) 60801 (90%) 54046

Compactness of Frequent Closed Itemsets Example: Dataset Connect-4 Support 64179 (95%) 60801 (90%) 54046 (80%) #FCI 812 3486 15107 #FI 2205 27127 533975 #FI/#FCI 2. 72 7. 78 35. 35 47290 (70%) 35875 4129839 115. 12

Scalability with Support Threshold on Dataset T 25 I 20 D 100 k

Scalability with Support Threshold on Dataset T 25 I 20 D 100 k

Scalability With Support Threshold on Dataset Connect-4

Scalability With Support Threshold on Dataset Connect-4

Scalability With Support Threshold on Dataset Pumsb

Scalability With Support Threshold on Dataset Pumsb

Size Scaleup on Datasets

Size Scaleup on Datasets

Conclusions CLOSET is an FP-tree-based database projection method for efficient mining of frequent closed

Conclusions CLOSET is an FP-tree-based database projection method for efficient mining of frequent closed itemsets in large databases n n n Applying FP-tree structure Developing techniques to identify frequent closed itemsets quickly Exploring a partition-based projection mechanism for scalable mining CLOSET can be straightforwardly extended to mine max-patterns

References R. Agarwal, C. Aggarwal and V. V. V. Prasad. A tree projection algorithm

References R. Agarwal, C. Aggarwal and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing, (to appear), 2000 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. VLDB’ 94, Chile, September 1994 R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. SIGMOD’ 98, WA, June 1998 J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate generation. In Proc. SIGMOD’ 00, TX, May 2000 H. Mannila, H. Toivonen and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. KDD’ 94, WA, July 1994 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. ICDT’ 99, Israel, January 1999. Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Efficient Mining of Association Rules Using Closed Itemset Lattices. In Information Systems, Vol. 24, No. 1, 1999 M. J. Zaki and C. Hsiao. Ch. ARM: An efficient algorithm for closed association rule mining. In Tech. Rep. 99 -10, Computer Science, Rensselaer Polytechnic Institute, 1999.