Fast Algorithms for Mining Association Rules CS 401

Outlines ¨ Problem: Mining association rules between items in a large database ¨ Solution:

Introduction ¨ Mining association rules: Given a set of transactions D, the problem of

Terms and Concepts ¨ Associations rules, Support and Confidence Example: 98% of customer who

Problem Decomposition ¨ Find all sets of items that have transaction support above minimum

Discover Large Itemsets ¨ Step 1: Make multiple passes over the data and determine

Set of large k-itemsets(those with minimum support)Each member of this set has two fields

Apriori Candidate Generation ¨ Insert into Ck select p. item 1, p. item 2,

An Example of Apriori L 1={1, 2, 3, 4, 5, 6} Then the candidate

An Example of Apriori {1, 4, 5} {3, 4, 5} because {2, 5}{4, 5}

Advantages ¨ The Apriori algorithm generates the candidate itemsets found in a pass by

Algorithm Apriori. Tid ¨ Aprioti. Tid algorithm also uses the apriori- gen function to

¨Average size of the transactions ¨Number of Transaction Comparison with other algorithms ¨Average size

¨ Diagram SETM, 1 -6 weshow have the only execution plotteddid the times execution

We did not plot the execution times in Performance (7) on the corresponding graphs

Conclusion ¨ We presented two new algorithms, Apriori and Apriori. Tid, for discovering all

Slides: 16

Download presentation

Fast Algorithms for Mining Association Rules * CS 401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam Scrikant, IBM Research Center

Outlines ¨ Problem: Mining association rules between items in a large database ¨ Solution: Two new algorithms – Apriori. Tid ¨ Examples ¨ Comparison with other algorithms(SETM &AIS) ¨ Conclusions

Introduction ¨ Mining association rules: Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum support(called minsup) and minimum confidence(called minconf) respectively

Terms and Concepts ¨ Associations rules, Support and Confidence Example: 98% of customer who buy Let L={i 1, i 2, …. im} be a set of items. Let D be a set of bread also buy milk. Bread means transactions, where each transaction T is a set of items or implies milk 98% of the time. such that T L An association rule is an implication of the form X=>Y, where X L, Y L, and X Y=. The rule X=>Y holds in the transactions set D with confidence c if c% of transaction in D that contain X also contains Y. The rule X=>Y has support s in the transaction set D if s% of transaction in D contain X Y

Problem Decomposition ¨ Find all sets of items that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets ¨ Use the large itemsets to generate the desired rules.

Discover Large Itemsets ¨ Step 1: Make multiple passes over the data and determine large itemsets, i. e. with minimum support ¨ Step 2: Use seed set for generating candidate itemsets and count the actual support ¨ Step 3: determine the large candidate itemsets and use them for the next pass ¨ Continues until no new large itemsets are found

Set of large k-itemsets(those with minimum support)Each member of this set has two fields i)itemset ii)support count Set of candidate k-itemsets(potentially large itemsets). Each member of this set has two fields: i)itemset and ii) support count Algorithm Apriori 1) L 1 = large 1 -itemsets ; 2) for (k=2; Lk-1 0; k++) do begin 3) Ck = aprioti-gen(Lk-1); // New candidates 4) for all transactions t D do begin 5) Ct=subset(Ck, t); // Candidate contained in t 6) for all candidates c Ct do 7) c. count++; 8) end; 9) Lk = {c Ck | c. count minsup}; 10) end; 11) Answer = k. Lk;

Apriori Candidate Generation ¨ Insert into Ck select p. item 1, p. item 2, …p. itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item 1=q. item 1, …. p. itemk-2=q. itemk-2 p. itemk-1<q. itemk-1 next , in the prune stepwe delete all itemsets c Ck such that some (k-1) –subset of c is not in Lk-1: for all itemsets set c Ck do for all (k-1) –subset s of c do if ( s Lk-1) then delete c form Ck

An Example of Apriori L 1={1, 2, 3, 4, 5, 6} Then the candidate set that will be generated by our algorithm will be: C 2={{1, 2}{1, 3}{1, 4}{1, 5}{1, 6}{2, 3}{2, 4}{2, 5} {2, 6}{3, 4}{3, 5}{3, 6}{4, 5}{4, 6}{5, 6}}Then from the candidate set we generate the large itemset L 2={{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {3, 5}} whose support =2 C 3={{1, 2, 3}, {1, 2, 4}, {1, 2, 5}{1, 3, 4}, {1, 3, 5}, {1, 4, 5}{2, 3, 4}, { 3, 4, 5}}Then from the candidate set we generate the large itemset Then the prune step will delete the itemset {1, 2, 5}

An Example of Apriori {1, 4, 5} {3, 4, 5} because {2, 5}{4, 5} are not in L 2 L 3={{1, 2, 3}, {1, 2, 4}, {1, 3, 5}, {2, 3, 4}} suppose All of these itemsets has support not less than 2 C 4 will be {{1, 2, 3, 4}{1, 3, 4, 5}} the prune step will delete the itemset {1, 3, 4, 5} because the itemset {1, 4, 5} is not it L 3 we will then be left with only {1, 2, 3, 4} in C 4 L 4={} if the support of {1, 2, 3, 4} is less than 2. And the algorithm will stop generating the large itemsets.

Advantages ¨ The Apriori algorithm generates the candidate itemsets found in a pass by using only the itemsets found large in the previous pass – without considering the transactions in the database. The basic intuition is that any subset of a large itemset must be large. Therefore, the candidate itemsets having k items can be generated by joining large itemsets having k-1 items, and deleting those that contain any subset that is not large. This procedure results in generation of a much smaller number of candidate itemsets.

Algorithm Apriori. Tid ¨ Aprioti. Tid algorithm also uses the apriori- gen function to determine the candidate itemsets before the pass begins. The interesting feature of this algorithm is that the database D is not used for counting support after the first pass. Rather, the set Ck’ is used for this purpose. Set of candidate k-itemsets when the TIDs of the generating transactions are kept associated with the candidates (TID: the unique identifier Associated with each transaction

¨Average size of the transactions ¨Number of Transaction Comparison with other algorithms ¨Average size of the ¨ Parameter Settings Name T 5. I 2. D 100 K |T| 5 maximal potentially large itemsets |I| |D| Size in Megabytes 2 100 K 2. 4 T 10. I 2. D 100 K 10 T 10. I 4. D 100 K 10 2 4 100 K 4. 4 100 K T 20. I 2. D 100 K 20 T 20. I 4. D 100 K 20 T 20. I 6. D 100 K 20 2 4 6 100 K 8. 4 100 K

¨ Diagram SETM, 1 -6 weshow have the only execution plotteddid the times execution forasthe times six datasets for Apriori given beatin. AIS for all problem sizes, by For small problems, Apriori. Tid about well the. Apriori, table onbut dataset T 5. I 2. D 100 K lastperformance slide for in decreasing Relative values Performance of minimum (1). factors support. ranging As from 2 for high minimum as degraded to about the minimum The execution support times fordecreases, SETM forthe theexecution two datasets timeswith ofsupport all the to more than an order of magnitude twice as slow for large problems. algorithms an average increase transaction because size ofof 10 increases are given in in the total number for lowof levels of support. AIS always did Performance considerably better than SETM. candidate and(7). large itemsets. Relative Performance (1 -6) For the three datasets with transaction sizes of 20, SETM took too long to execute and we aborted those runs as the trends were clear. Clearly, Apriori beats SETM by more than an order of magnitude for large datasets.

We did not plot the execution times in Performance (7) on the corresponding graphs because they are too large compared to the execution times of the other algorithms. Relative Performance (7) Clearly, Apriori beats SETM by more than an order of magnitude for large datasets. Algorithm Minimum Support 2. 0 % 1. 5 % 1. 0 % 0. 75 % 0. 5 % Dataset T 10. I 2. D 100 K SETM 74 161 838 1262 1878 Apriori 4. 4 5. 3 11. 0 14. 5 15. 3 Dataset T 10. I 4. D 100 K SETM 41 91 659 929 1639 Apriori 3. 8 4. 8 11. 2 17. 4 19. 3

Conclusion ¨ We presented two new algorithms, Apriori and Apriori. Tid, for discovering all significant association rules between items in a large database of transactions. We compared these algorithms to the previously known algorithms, the AIS and SETM. We presented the experimental results, showing that the proposed algorithms always outperform AIS and SETM. The performance gap increased with the problem size, and ranged from a factor of three for small problems to more than an order of magnitude for large problems.