Gen Max From Efficiently Mining Frequent Itemsets By

Gen. Max From: “Efficiently Mining Frequent Itemsets” By : Karam Gouda & Mohammed J. Zaki Zeev Dvir – dvirzeev@post. tau. ac. il

The Problem • Given a large database of items transactions, find all frequent itemsets • A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base • We call this percentage : min_sup (for minimum support). Zeev Dvir – dvirzeev@post. tau. ac. il

• A Maximal Frequent Itemset is a frequent itemset, that doesn’t have a frequent superset • FI : = frequent itemsets MFI : = maximal frequent itemsets • Fact: |MFI| << |FI| Gen. Max is an algorithm to find the exact MFI Zeev Dvir – dvirzeev@post. tau. ac. il

Example Item/ Tid 1 2 3 4 5 6 7 A B C x x x D Min_sup = 3 ABCD x ABC ACD BCD x ABD AC AD BC BD x x A Zeev Dvir – dvirzeev@post. tau. ac. il B C D CD

Some Useful Definitions • The Combine-Set of an itemset I , is the set of items that can be added to I to create a frequent itemset. • For example , in the previous example, The combine-set of the itemset {A} is {B, C}. • The combine-set of the empty itemset is called F 1 and is actually the set of frequent itemsets ofsize 1. Zeev Dvir – dvirzeev@post. tau. ac. il

Zeev Dvir – dvirzeev@post. tau. ac. il

Improvement • At each level, sort the combine-set (C) in increasing order of support • An itemset with low support has a smaller chance of producing a large combine-set in the next level • The sooner we prune the tree, the more work we save • This heuristic was first used in Max. Miner Zeev Dvir – dvirzeev@post. tau. ac. il

Bottlenecks 1. Superset checking : The best algorithms for superset checking give an amortized bound of per operation. that’s bad if we have many itemsets in the MFI. 2. Frequency testing : How can we make frequency testing faster ? Zeev Dvir – dvirzeev@post. tau. ac. il

Optimizing Superset Checking • A technique called “Progressive Focusing” is used to narrow down the group of potential supersets, as the recursive calls are made • LMFI : = Local MFI • Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added. Zeev Dvir – dvirzeev@post. tau. ac. il

LMFI Example FGHI FGH FG FGHJ … FGI … … Zeev Dvir – dvirzeev@post. tau. ac. il

Zeev Dvir – dvirzeev@post. tau. ac. il

Frequency Testing Optimization • Gen. Max uses a “vertical database format”: • For each item , we have a set of all the transactions containing this item. • This set is called a tidset. (Transaction ID Set). • This method makes support computations easier, because we don’t have to go over the entire database. Zeev Dvir – dvirzeev@post. tau. ac. il

Vertical Database Item /Tid 1 2 3 4 5 6 7 A x x B x x x C x x x D x x A {1, 3, 4, 5} B {1, 3, 4, 6} C {1 , 2 , 3 , 4 , 7} D {2, 4, 6} t(A) = {1, 3, 4, 5} t(AC) = {1, 3, 4} supp(I) = |t(I)| Zeev Dvir – dvirzeev@post. tau. ac. il

ABC ABD ABE AB = { C t(ABC) Each item y in the combine-set … , E } t(ABE) , actually represents the itemset , and stores the tidset associated with it. Zeev Dvir – dvirzeev@post. tau. ac. il

Additional Optimization • Diffsets: don’t store the entire tidsets, only the differences between tidsets (described in “Fast Vertical Mining Using Diffsets”) Zeev Dvir – dvirzeev@post. tau. ac. il

Experimental Results • Gen. Max is compared with: Max. Miner , MAFIA-PP • Max. Miner & MAFIA-PP give the exact MFI, while MAFIA gives a superset of the MFI • The Databases used in the experiments are grouped according to the MFI length distribution Zeev Dvir – dvirzeev@post. tau. ac. il

Type I Datasets Zeev Dvir – dvirzeev@post. tau. ac. il

Type II Datasets Zeev Dvir – dvirzeev@post. tau. ac. il

Type III Datasets Zeev Dvir – dvirzeev@post. tau. ac. il

Type IV Datasets Zeev Dvir – dvirzeev@post. tau. ac. il

Zeev Dvir – dvirzeev@post. tau. ac. il