Frequent Itemset Mining Methods The Apriori algorithm l

The Apriori algorithm l l l l Finding frequent itemsets using candidate generation Seminal

l Using the apriori property in the algorithm: ¡ l Let us look at

Example: TID T 100 T 200 T 300 T 400 T 500 T 600

l Scan D for count of each candidate ¡ l Compare candidate support count

Generating association rules from frequent itemsets l Finding the frequent itemsets from transactions in

for every nonempty susbset s of l, output the rule s=>(l-s) if support_count(l)/support_count(s)>=min_conf l

Improving the efficiency of Apriori l Hash-based technique – to reduce the size of

Mining Frequent Itemsets without candidate generation l The candidate generate and test method ¡

Item Conditional Pattern Base Conditional FP-tree Frequent Pattern Generated I 5 {{I 2, I

Mining frequent itemsets using vertical data format l Transforming the horizontal data format of

Slides: 16

Download presentation

Frequent Itemset Mining Methods

The Apriori algorithm l l l l Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 Uses an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)itemsets. Apriori property to reduce the search space: All nonempty subsets of a frequent itemset must also be frequent. P(I)<min_sup => I is not frequent P(I+A)<min_sup => I+A is not frequent either Antimonotone property – if a set cannot pass a test, all of its supersets will fail the same test as well

l Using the apriori property in the algorithm: ¡ l Let us look at how Lk-1 is used to find Lk, for k>=2 Two steps: ¡ Join l l l ¡ finding Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself The items within a transaction or itemset are sorted in lexicographic order For the (k-1) itemset: li[1]<li[2]<…<li[k-1] The members of Lk-1 are joinable if their first(k-2) items are in common Members l 1, l 2 of Lk-1 are joined if (l 1[1]=l 2[1]) and (l 1[2]=l 2[2]) and … and (l 1[k-2]=l 2[k-2]) and (l 1[k-1]<l 2[k-1]) – no duplicates The resulting itemset formed by joining l 1 and l 2 is l 1[1], l 1[2], …, l 1[k-2], l 1[k 1], l 2[k-1] Prune l l l Ck is a superset of Lk, Lk contain those candidates from Ck, which are frequent Scaning the database to determine the count of each candidate in Ck – heavy computation To reduce the size of Ck the Apriori property is used: if any (k-1) subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either, so it can be removed from Ck. – subset testing (hash tree)

Example: TID T 100 T 200 T 300 T 400 T 500 T 600 T 700 T 800 T 900 List of item_IDs I 1, I 2, I 5 I 2, I 4 I 2, I 3 I 1, I 2, I 4 I 1, I 3 I 2, I 3 I 1, I 2, I 3, I 5 I 1, I 2, I 3

l Scan D for count of each candidate ¡ l Compare candidate support count with minimum support count (min_sup=2) ¡ l ¡ C 3: {I 1, I 2, I 3} - 2, {I 1, I 2, I 5} – 2 Compare candidate support count with minimum support count ¡ l Join: C 3=L 2 x. L 2={{I 1, I 2, I 3}, {I 1, I 2, I 5}, {I 1, I 3, I 5}, {I 2, I 3, I 4}, {I 2, I 3, I 5}, {I 2, I 4, I 5}} Prune: C 3: {I 1, I 2, I 3}, {I 1, I 2, I 5} Scan D for count of each candidate ¡ l L 2: {I 1, I 2} – 4, {I 1, I 3} – 4, {I 1, I 5} – 2, {I 2, I 3} – 4, {I 2, I 4} - 2, {I 2, I 5} – 2 Generate C 3 candidates from L 2 using the join and prune steps: ¡ l C 2: {I 1, I 2} – 4, {I 1, I 3} – 4, {I 1, I 4} – 1, … Compare candidate support count with minimum support count ¡ l L 1: I 1 – 6, I 2 – 7, I 3 -6, I 4 – 2, I 5 - 2 Generate C 2 candidates from L 1 and scan D for count of each candidate ¡ l C 1: I 1 – 6, I 2 – 7, I 3 -6, I 4 – 2, I 5 - 2 L 3: {I 1, I 2, I 3} – 2, {I 1, I 2, I 5} – 2 Generate C 4 candidates from L 3 ¡ ¡ C 4=L 3 x. L 3={I 1, I 2, I 3, I 5} This itemset is pruned, because its subset {{I 2, I 3, I 5}} is not frequent => C 4=null

Generating association rules from frequent itemsets l Finding the frequent itemsets from transactions in a database D l Generating strong association rules: ¡ Confidence(A=>B)=P(B|A)= support_count(AUB)/support_count(A) support_count(AUB) – number of transactions containing the itemsets AUB l support_count(A) - number of transactions containing the itemsets A l

for every nonempty susbset s of l, output the rule s=>(l-s) if support_count(l)/support_count(s)>=min_conf l Example: l ¡ ¡ ¡ lets have l={I 1, I 2, I 5} The nonempty subsets are {I 1, I 2}, {I 1, I 5}, {I 2, I 5}, {I 1}, {I 2}, {I 5}. Generating association rules: I 1 and I 2=>I 5 I 1 and I 5=>I 2 and I 5=> I 1=>I 2 and I 5 I 2=>I 1 and I 5=>I 1 and I 2 conf=2/4=50% conf=2/2=100% conf=2/6=33% conf=2/7=29% conf=2/2=100% If min_conf is 70%, then only the second, third and last rules above are output.

Improving the efficiency of Apriori l Hash-based technique – to reduce the size of the candidate k-itemsets, Ck, for k>1 ¡ ¡ Generate all of the 2 -itemsets for each transaction, hash them into a different buckets of a hash table structure H(x, y)=((order of x)X 10+(order of y)) mod 7 l Transaction reduction – a transaction that does not contain any frequent k-itemsets cannot contain any frequent k+1 itemsets. Partitioning – partitioning the data to find candidate itemsets l Sampling – mining on a subset of a given data l ¡ ¡ l searching for frequents itemsets in subset S, instead of D Lower support threshold Dynamic itemset counting – adding candidate itemsets at different points during a scan

Mining Frequent Itemsets without candidate generation l The candidate generate and test method ¡ Reduces the size of candidates sets ¡ Good performance ¡ It may need to generate a huge number of candidate sets ¡ It may need to repeatedly scan the database and check a large set of candidates by pattern matching l Frequent-pattern growth method(FPgrowth) – frequent pattern tree(FP-tree)

Example:

l I 5 ¡ ¡ l I 5 is a suffix, so the two prefixes are ¡ ¡ l l ¡ ¡ {{I 2, I 1: 1}, {I 2: 1}} Generation of the conditional FP-tree: ¡ l {I 2, I 5: 2} {I 1, I 5: 2} {I 2, I 1, I 5: 2} For I 4 exist 2 prefixes: ¡ l (I 2, I 1: 1) (I 2, I 1, I 3: 1) FP tree: (I 2: 2, I 1: 2), I 3 is removed because <2 The combinations of frequent pattenrs: ¡ l (I 2, I 1, I 5: 1) (I 2, I 1, I 3, I 5: 1) (I 2: 2) The frequent pattern: {I 2, I 1: 2}

Item Conditional Pattern Base Conditional FP-tree Frequent Pattern Generated I 5 {{I 2, I 1: 1}, {I 2, I 1, I 3: 1}} (I 2: 2, I 1: 2) {I 2, I 5: 2}, {I 1, I 5: 2}, {I 2, I 1, I 5: 2} I 4 {{I 2, I 1: 2}, {I 2: 1}} (I 2: 2) {I 2, I 4: 2} I 3 {{I 2, I 1: 2}, {I 2: 2}, {I 1: 2}} (I 2: 4, I 1: 2), (I 2: 4) I 1 {{I 2: 4}} (I 2: 4) {I 2, I 3: 4}, {I 1, I 3: 4}, {I 2, I 1, I 3: 2}, {I 2, I 1: 4}

Mining frequent itemsets using vertical data format l Transforming the horizontal data format of the transaction database D into a vertical data format: Itemset I 1 I 2 I 3 I 4 I 5 TID_set {T 100, T 400, T 500, T 700, T 800, T 900} {T 100, T 200, T 300, T 400, T 600, T 800, T 900} {T 300, T 500, T 600, T 700, T 800, T 900} {T 200, T 400} {T 100, T 800}

Thank you