COMP 5331 Association Rule Mining Prepared by Raymond
COMP 5331 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP 5331 1
Introduction Supermarket Application Item History or Transaction Raymond apple coke David coffee … diaper coke Emily An interesting association: … milk biscuit Derek We want to find some associations between items. Diaper and Beer are usually bought together. Why? Is it strange? … COMP 5331 coke milk diaper beer 2
Introduction Supermarket Application An interesting association: Diaper and Beer are usually bought together. Why? Is it strange? COMP 5331 diaper beer 3
An interesting association: Introduction Supermarket Application Diaper and Beer are usually bought together. Why? Is it strange? diaper beer Reasons: This pattern occurs frequently in the early evening. Daytime Office Working… COMP 5331 4
An interesting association: Introduction Supermarket Application Diaper and Beer are usually bought together. Why? Is it strange? diaper beer Reasons: This pattern occurs frequently in the early evening. Early Evening Office COMP 5331 Morning Please buy diapers Home 5
Introduction n Applications of Association Rule Mining n n n Supermarket Web Mining Medical analysis Bioinformatics Network analysis (e. g. , Denial-of-service (Do. S)) Programming Pattern Finding COMP 5331 6
Outline n Association Rule Mining n n n Problem Definition NP-hardness Algorithm Apriori n n Properties Algorithm COMP 5331 7
Association Rule Mining TID A B C D E t 1 1 0 0 1 0 A, D t 2 1 1 0 1 1 A, B, D, E t 3 0 1 1 0 0 B, C t 4 1 1 1 A, B, C, D, E t 5 0 1 1 0 1 B, C, E Single Items (or simply items): A Itemsets: {B, C} 2 -itemset COMP 5331 B {A, B, C} 3 -itemset C D {B, C, D} 3 -itemset E {A} 1 -itemset 8
Large itemsets: itemsets with support >= a threshold (e. g. , 3) Association Rule Mining Frequent itemsets TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Single Items (or simply items): A Itemsets: {B, C} {A, B, C} B e. g. , {A}, {B, C} but NOT {A, B, C} Support = 3 Support = 4 C D {B, C, D} E {A} 1 -frequent itemset of size 3 Support COMP 5331 =3 Support = 1 3 -frequent itemset of size 2 9
Association Rule Mining TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Association rules: COMP 5331 Support = 2 {B, C} E 10
Association Rule Mining TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Association rules: COMP 5331 Support = 2 Confidence = 2/3 = 66. 7% {B, C} E 11
Association Rule Mining TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Association rules: {B, C} E Support = 2 Confidence = 2/3 = 66. 7% Support = 3 B C COMP 5331 12
Association Rule Mining TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Association rules: {B, C} E Support = 2 Confidence = 2/3 = 66. 7% Support = 3 B C Confidence= 3/4 = 75% COMP 5331 13
Association rules with 1. Support >= a threshold (e. g. , 3) 2. Confidence >= another threshold (e. g. , 50%) Association Rule Mining TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Problem: We want to find some “interesting” association rules {B, C} E Support = 2 Confidence = 2/3 = 66. 7% B C How can we find all “interesting” association rules? Step 1: to find all “large” itemsets (i. e. , itemsets with support >= 3) (e. g. , itemset {B, C} has support = 3) Step 2: to find all “interesting” rules after Step 1 - from all “large” itemsets find the association rule with confidence COMP 5331 >= 50% Support = 3 Confidence = 3/4 = 75% … 14
Outline n Association Rule Mining n n n Problem Definition NP-hardness Algorithm Apriori n n Properties Algorithm COMP 5331 15
NP-Completeness Problem: to find all “large” itemsets (i. e. , itemsets with support >= 3) Problem: to find all “large” J-itemsets for each positive integer J (i. e. , J-itemsets with support >= 3) Step 1: to find all “large” itemsets (i. e. , itemsets with support >= 3) (e. g. , itemset {B, C} has support = 3) Step 2: to find all “interesting” rules after Step 1 - from all “large” itemsets find the association rule with confidence COMP 5331 >= 50% 16
NP-Completeness n Finding Large J-itemsets n n INSTANCE: Given a database of transaction records QUESTION: Is there an f-frequent itemset of size J? COMP 5331 Egg 1 0 0 Rice 1 1 1 Oil 1 1 1 Juice 0 0 1 1 0 17
NP-Completeness n NP-complete problem Balanced Complete Bipartite Subgraph n n INSTANCE: Bipartite graph G = (V, E), positive integer K |V| QUESTION: Are there two disjoint subsets V 1, V 2 V such that |V 1| = |V 2| = K and such that, for each u V 1 and each v V 2 , {u, v} E? A E B F C G D H COMP 5331 18
NP-Completeness n n We can transform the graph problem into itemset problem. n For each vertex in V 1, create a transaction n For each vertex in V 2, create an item n For each edge (u, v), create a purchase of item v in transaction u n f K n J K Is there a K-frequent itemset of size K? A E B F C G D H COMP 5331 A B C D E 1 1 1 0 F 1 1 G 1 1 1 0 H 0 0 19
NP-Completeness n n It is easy to verify that solving the problem Finding Large K-itemsets is equal to solving problem Balanced Complete Bipartite Subgraph Finding Large K-itemsets is NP-hard. COMP 5331 20
Methods to prove that a problem P is NP-hard n n n Step 1: Find an existing NP-complete problem (e. g. , complete bipartite graph) Step 2: Transform this NP-complete problem to P (in polynomial-time) Step 3: Show that solving the “transformed” problem is equal to solving “original” NP-complete problem COMP 5331 21
Outline n Association Rule Mining n n n Problem Definition NP-hardness Algorithm Apriori n n Properties Algorithm COMP 5331 22
Suppose we want to find all “large” itemsets (e. g. , itemsets with support >= 3) Apriori TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 {B, C} is large Support of {B, C} = 3 Is {B} large? Is {C} large? COMP 5331 Property 1: If an itemset S is large, then any proper subset of S must be large. 23
Suppose we want to find all “large” itemsets (e. g. , itemsets with support >= 3) Apriori TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 {B, C, E} is NOT large Support of {B, C, E} = 2 Is {A, B, C, E} large? Is {B, C, D, E} large? COMP 5331 Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. 24
Apriori Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. COMP 5331 25
Outline n Association Rule Mining n n n Problem Definition NP-hardness Algorithm Apriori n n Properties Algorithm COMP 5331 26
Apriori TID A B C D E Item Count t 1 1 0 0 1 0 A 3 t 2 1 1 0 1 1 B t 3 0 1 1 0 0 C t 4 1 1 1 D t 5 0 1 1 0 1 E COMP 5331 27
Suppose we want to find all “large” itemsets (e. g. , itemsets with support >= 3) Apriori TID A B C D E Item Count t 1 1 0 0 1 0 A 3 t 2 1 1 0 1 1 B 4 t 3 0 1 1 0 0 C 3 t 4 1 1 1 D 3 t 5 0 1 1 0 1 E 3 Thus, {A}, {B}, {C}, {D} and {E} are “large” itemsets of size 1 (or, “large” 1 -itemsets). We set L 1 = {{A}, {B}, {C}, {D}, {E}} COMP 5331 28
Suppose we want to find all “large” itemsets (e. g. , itemsets with support >= 3) Apriori L 1 TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Large 2 -itemset Generation Candidate Generation C 2 “Large” Itemset Generation L 2 Large 3 -itemset Generation Candidate Generation Thus, {A}, {B}, {C}, {D} and {E} are “large” itemsets of size 1 (or, “large” 1 -itemsets). C 3 We set L 1 = {{A}, {B}, {C}, {D}, {E}} COMP 5331 “Large” Itemset Generation L 3 … 29
Suppose we want to find all “large” itemsets 1. (e. g. , Join Step itemsets with support >= 3) 2. Prune Step Apriori L 1 TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Large 2 -itemset Generation Candidate Generation C 2 “Large” Itemset Generation L 2 Counting Step Large 3 -itemset Generation Candidate Generation Thus, {A}, {B}, {C}, {D} and {E} are “large” itemsets of size 1 (or, “large” 1 -itemsets). C 3 We set L 1 = {{A}, {B}, {C}, {D}, {E}} COMP 5331 “Large” Itemset Generation L 3 … 30
Candidate Generation n n Join Step Prune Step COMP 5331 31
Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. Join Step TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Suppose we know that itemset {B, C} and itemset {B, E} are large (i. e. , L 2). It is possible that itemset {B, C, E} is also large (i. e. , C 3). COMP 5331 32
Join Step n n n Input: Lk-1, a set of all large (k-1)-itemsets Output: Ck, a set of candidates k-itemsets Algorithm: n insert into Ck select p. item 1, p. item 2, …, p. itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item 1 = q. item 1, p. item 2 = q. item 2, … p. itemk-2 = q. itemk-2, p. itemk-1 < q. itemk-1 COMP 5331 33
Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. Prune Step TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Suppose we know that itemset {B, C} and itemset {B, E} are large (i. e. , L 2). It is possible that itemset {B, C, E} is also large (i. e. , C 3). COMP 5331 Suppose we know that {C, E} is not large. We can prune {B, C, E} in C 3. 34
Prune Step n forall itemsets c Ck (from Join Step) do n for all (k-1)-subsets s of c do n COMP 5331 if (s not in Lk-1) then n delete c from Ck 35
Suppose we want to find all “large” itemsets 1. (e. g. , Join Step itemsets with support >= 3) 2. Prune Step Apriori L 1 TID A B C D E t 1 1 0 0 1 0 t 2 1 1 0 1 1 t 3 0 1 1 0 0 t 4 1 1 1 t 5 0 1 1 0 1 Large 2 -itemset Generation Candidate Generation C 2 “Large” Itemset Generation L 2 Counting Step Large 3 -itemset Generation Candidate Generation Thus, {A}, {B}, {C}, {D} and {E} are “large” itemsets of size 1 (or, “large” 1 -itemsets). C 3 We set L 1 = {{A}, {B}, {C}, {D}, {E}} COMP 5331 “Large” Itemset Generation L 3 … 36
Counting Step n n After the candidate generation (i. e. , Join Step and Prune Step), we are given a set of candidate itemsets We need to verify whether these candidate itemsets are large or not We have to scan the database to obtain the count of each itemset in the candidate set. Algorithm n For each itemset c in Ck, n n obtain the count of c (from the database) If the count of c is smaller than a given threshold, n n remove it from Ck The remaining itemsets in Ck correspond to Lk COMP 5331 37
- Slides: 37