DATA MINING Introductory and Advanced Topics Part II


































- Slides: 34
DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Part II - Association Rules © Prentice Hall 1
Association Rules Outline Goal: Provide an overview of basic Association Rule mining techniques n Association Rules Problem Overview – Large itemsets n Association Rules Algorithms – Apriori – Sampling – Partitioning – Parallel Algorithms Part II – Association Rules © Prentice Hall 2
Example: Market Basket Data n Items frequently purchased together: Bread Peanut. Butter n Uses: – Placement – Advertising – Sales – Coupons n Objective: increase sales and reduce costs Part II – Association Rules © Prentice Hall 3
Association Rule Definitions Set of items: I={I 1, I 2, …, Im} n Transactions: D={t 1, t 2, …, tn}, tj I n Itemset: {Ii 1, Ii 2, …, Iik} I n Support of an itemset: Percentage of transactions which contain that itemset. n Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. n Part II – Association Rules © Prentice Hall 4
Association Rules Example I = { Beer, Bread, Jelly, Milk, Peanut. Butter} Support of {Bread, Peanut. Butter} is 60% Part II – Association Rules © Prentice Hall 5
Association Rule Definitions Association Rule (AR): implication X Y where X, Y I and X Y = ; n Support of AR (s) X Y: Percentage of transactions that contain X Y n Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X n Part II – Association Rules © Prentice Hall 6
Association Rules Ex (cont’d) Part II – Association Rules © Prentice Hall 7
Association Rule Problem Given a set of items I={I 1, I 2, …, Im} and a database of transactions D={t 1, t 2, …, tn} where ti={Ii 1, Ii 2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. n Link Analysis n NOTE: Support of X Y is same as support of X Y. n Part II – Association Rules © Prentice Hall 8
Association Rule Techniques 1. 2. Find Large Itemsets. Generate rules from frequent itemsets. Part II – Association Rules © Prentice Hall 9
Algorithm to Generate ARs Part II – Association Rules © Prentice Hall 10
Apriori Large Itemset Property: Any subset of a large itemset is large. n Contrapositive: If an itemset is not large, none of its supersets are large. n Part II – Association Rules © Prentice Hall 11
Large Itemset Property Part II – Association Rules © Prentice Hall 12
Apriori Ex (cont’d) s=30% Part II – Association Rules a = 50% © Prentice Hall 13
Apriori Algorithm 1. 2. 3. 4. 5. 6. 7. 8. C 1 = Itemsets of size one in I; Determine all large itemsets of size 1, L 1; i = 1; Repeat i = i + 1; Ci = Apriori-Gen(Li-1); Count Ci to determine Li; until no more large itemsets found; Part II – Association Rules © Prentice Hall 14
Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. n Only generates a candidate if all subsets are large n Approach used: join large itemsets of size i if they agree on i-1 n Part II – Association Rules © Prentice Hall 15
Apriori-Gen Example Part II – Association Rules © Prentice Hall 16
Apriori-Gen Example (cont’d) Part II – Association Rules © Prentice Hall 17
Apriori Adv/Disadv n Advantages: – Uses large itemset property. – Easily parallelized – Easy to implement. n Disadvantages: – Assumes transaction database is memory resident. – Requires up to m database scans. Part II – Association Rules © Prentice Hall 18
Sampling n n Large databases Sample the database and apply Apriori to the sample. Potentially Large Itemsets (PL): Large itemsets from sample Negative Border (BD - ): – Generalization of Apriori-Gen applied to itemsets of varying sizes. – Minimal set of itemsets which are not in PL, but whose subsets are all in PL. Part II – Association Rules © Prentice Hall 19
Negative Border Example PL Part II – Association Rules © Prentice Hall PL BD-(PL) 20
Sampling Algorithm Ds = sample of Database D; 2. PL = Large itemsets in Ds using smalls; 3. C = PL BD-(PL); 4. Count C in Database using s; 5. ML = large itemsets in BD-(PL); 6. If ML = then done 7. else C = repeated application of BD-; Part II – Association © Prentice Hall 8. Rules Count C in Database; 1. 21
Sampling Example n n n n Find AR assuming s = 20% Ds = { t 1, t 2} Smalls = 10% PL = {{Bread}, {Jelly}, {Peanut. Butter}, {Bread, Jelly}, {Bread, Peanut. Butter}, {Jelly, Peanut. Butter}, {Bread, Jelly, Peanut. Butter}} BD-(PL)={{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BD- generates all remaining itemsets Part II – Association Rules © Prentice Hall 22
Sampling Adv/Disadv n Advantages: – Reduces number of database scans to one in the best case and two in worst. – Scales better. n Disadvantages: – Potentially large number of candidates in second pass Part II – Association Rules © Prentice Hall 23
Partitioning Divide database into partitions D 1, D 2, …, Dp n Apply Apriori to each partition n Any large itemset must be large in at least one partition. n Part II – Association Rules © Prentice Hall 24
Partitioning Algorithm 1. 2. 3. 4. 5. Divide D into partitions D 1, D 2, …, Dp; For I = 1 to p do Li = Apriori(Di); C = L 1 … L p; Count C on D to generate L; Part II – Association Rules © Prentice Hall 25
Partitioning Example L 1 ={{Bread}, {Jelly}, {Peanut. Butter}, {Bread, Jelly}, {Bread, Peanut. Butter}, {Jelly, Peanut. Butter}, {Bread, Jelly, Peanut. Butter}} D 1 D 2 S=10% Part II – Association Rules L 2 ={{Bread}, {Milk}, {Peanut. Butter}, {Bread, Milk}, {Bread, Peanut. Butter}, {Milk, Peanut. Butter}, {Bread, Milk, Peanut. Butter}, {Beer, Bread}, {Beer, Milk}} © Prentice Hall 26
Partitioning Adv/Disadv n Advantages: – Adapts to available main memory – Easily parallelized – Maximum number of database scans is two. n Disadvantages: – May have many candidates during second scan. Part II – Association Rules © Prentice Hall 27
Parallelizing AR Algorithms n n Based on Apriori Techniques differ: – What is counted at each site – How data (transactions) are distributed n Data Parallelism – Data partitioned – Count Distribution Algorithm n Task Parallelism – – Data and candidates partitioned Data Distribution Algorithm Part II – Association Rules © Prentice Hall 28
Count Distribution Algorithm(CDA) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Place data partition at each site. In Parallel at each site do C 1 = Itemsets of size one in I; Count C 1; Broadcast counts to all sites; Determine global large itemsets of size 1, L 1; i = 1; Repeat i = i + 1; Ci = Apriori-Gen(Li-1); Count Ci; Broadcast counts to all sites; Determine global large itemsets of size i, Li; until no more large itemsets found; Part II – Association Rules © Prentice Hall 29
CDA Example Part II – Association Rules © Prentice Hall 30
Data Distribution Algorithm(DDA) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Place data partition at each site. In Parallel at each site do Determine local candidates of size 1 to count; Broadcast local transactions to other sites; Count local candidates of size 1 on all data; Determine large itemsets of size 1 for local candidates; Broadcast large itemsets to all sites; Determine L 1; i = 1; Repeat i = i + 1; Ci = Apriori-Gen(Li-1); Determine local candidates of size i to count; Count, broadcast, and find Li; until no more large itemsets found; Part II – Association Rules © Prentice Hall 31
DDA Example Part II – Association Rules © Prentice Hall 32
Incremental Association Rules Generate ARs in a dynamic database. n Problem: algorithms assume static database n Objective: n – Know large itemsets for D – Find large itemsets for D {D D} Must be large in either D or D D n Save Li and counts n Part II – Association Rules © Prentice Hall 33
Note on ARs n Many applications outside market basket data analysis – Prediction (telecom switch failure) – Web usage mining n Many different types of association rules – Temporal – Spatial – Causal Part II – Association Rules © Prentice Hall 34