COMP 5331 FPTree Prepared by Raymond Wong Presented
COMP 5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP 5331 1
Large Itemset Mining n Frequent Itemset Mining Problem: to find all “large” (or frequent) itemsets with support at least a threshold (i. e. , itemsets with support >= 3) COMP 5331 TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g 2
1. Join Step 2. Prune Step Apriori L 1 Large 2 -itemset Generation Candidate Generation Disadvantage 1: It is costly to handle a large number of candidate sets C 2 “Large” Itemset Generation Disadvantage 2: It is tedious to repeatedly scan the database and check the candidate patterns L 2 Counting Step Large 3 -itemset Generation Candidate Generation C 3 “Large” Itemset Generation COMP 5331 L 3 … 3
FP-tree n n Scan the database once to store all essential information in a data structure called FP-tree (Frequent Pattern Tree) The FP-tree is concise and is used in directly generating large itemsets COMP 5331 4
FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FPconditional tree for each item (or itemset). Step 4: Determine the frequent patterns. COMP 5331 5
FP-tree n Frequent Itemset Mining Problem: to find all “large” (or frequent) itemsets with support at least a threshold (i. e. , itemsets with support >= 3) COMP 5331 TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g 6
FP-tree COMP 5331 TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g 7
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g FP-tree COMP 5331 8
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g Item Frequency a 4 (Ordered) Frequent Items Threshold = 3 b c d e f g h i j k COMP 5331 9
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g Item Frequency a 4 b 4 1 c f 3 3 3 g 3 h 1 1 1 COMP 5331 1 d e i j k (Ordered) Frequent Items Threshold = 3 10
TID Items Bought (Ordered) Frequent Items 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j 400 a, b, d, i, k 500 a, b, e, g Item Frequency a 4 b 4 1 b 4 d 3 e 3 f 3 3 3 g 3 h 1 1 1 COMP 5331 1 c d e i j k Threshold = 3 11
TID Items Bought (Ordered) Frequent Items 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g a, b, d, e, f, g a, b, e, g Item Frequency a 4 b 4 1 b 4 d 3 e 3 f 3 3 3 g 3 h 1 1 1 COMP 5331 1 c d e i j k Threshold = 3 12
FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FPconditional tree for each item (or itemset). Step 4: Determine the frequent patterns. COMP 5331 13
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 1 a b: 1 b d d: 1 e f e: 1 g f: 1 COMP 5331 g: 1 14
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 2 a: 1 a b: 1 b d d: 1 e f f: 1 g: 1 e: 1 g f: 1 COMP 5331 g: 1 15
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 2 a b: 1 b d d: 1 e f b: 1 f: 1 g: 1 d: 1 e: 1 f: 1 g f: 1 COMP 5331 g: 1 16
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 3 a: 2 a b: 2 b: 1 b d d: 2 d: 1 e f b: 1 f: 1 g: 1 d: 1 e: 1 f: 1 g f: 1 COMP 5331 g: 1 17
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 4 a: 3 a b: 3 b: 2 b d e f f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 COMP 5331 g: 1 18
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 4 a b: 3 b d e f f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 COMP 5331 g: 1 19
FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FPconditional tree for each item (or itemset). Step 4: Determine the frequent patterns. COMP 5331 20
TID Items Bought 100 a, b, c, d, e, f, g, h 200 a, f, g 300 b, d, e, f, j b, d, e, f 400 a, b, d, i, k a, b, d 500 a, b, e, g Item (Ordered) Frequent Items Threshold = 3 a, b, d, e, f, g a, b, e, g root Head of node-link a: 4 a b: 3 b d e f f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 COMP 5331 g: 1 21
Item root Head of node-link a: 4 a b: 3 b d e f f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 COMP 5331 g: 1 22
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f Cond. FP-tree on “g” { (a: 1, b: 1, d: 1, e: 1, f: 1, g: 1), f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 g: 1 } COMP 5331 23
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f Cond. FP-tree on “g” { (a: 1, b: 1, d: 1, e: 1, f: 1, g: 1), (a: 1, b: 1, e: 1, g: 1), f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 g: 1 } COMP 5331 24
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “g” { (a: 1, b: 1, d: 1, e: 1, f: 1, g: 1), (a: 1, b: 1, e: 1, g: 1), (a: 1, f: 1, g: 1)} e: 1 g: 1 3 b 2 d 1 e 2 g g: 1 d: 1 e: 1 f: 1 g: 1 Frequency a f b: 1 f: 1 d: 2 g Item root Threshold = 3 2 COMP 5331 3 25
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 d: 1 g: 1 e: 1 f: 1 g Cond. FP-tree on “g” 3 { (a: 1, b: 1, d: 1, e: 1, f: 1, g: 1), (a: 1, b: 1, e: 1, g: 1), (a: 1, f: 1, g: 1)} Item Frequency Item f: 1 { (a: 1, g: 1), Frequency a 3 b 2 g 3 d 1 e 2 f g 2 COMP 5331 3 conditional pattern base of “g” g: 1 (a: 1, g: 1), (a: 1, g: 1) } Item a Head of node-link root a: 3 26
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f Cond. FP-tree on “f” { (a: 1, b: 1, d: 1, e: 1, f: 1), f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 g: 1 } COMP 5331 27
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f Cond. FP-tree on “f” { (a: 1, b: 1, d: 1, e: 1, f: 1), (a: 1, f: 1), f: 1 d: 2 e: 1 g: 1 g b: 1 g: 1 d: 1 e: 1 f: 1 g: 1 } COMP 5331 28
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “f” { (a: 1, b: 1, d: 1, e: 1, f: 1), (a: 1, f: 1), (b: 1, d: 1, e: 1, f: 1) } e: 1 g: 1 2 b 2 d 2 e 2 g g: 1 d: 1 e: 1 f: 1 g: 1 Frequency a f b: 1 f: 1 d: 2 g Item root Threshold = 3 3 COMP 5331 0 29
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 (f: 1), (f: 1) } (a: 1, f: 1), (b: 1, d: 1, e: 1, f: 1) } Frequency a 2 b 2 d 2 e 2 f g 3 COMP 5331 0 f: 1 { (f: 1), { (a: 1, b: 1, d: 1, e: 1, f: 1), Item e: 1 f: 1 g Cond. FP-tree on “f” 3 d: 1 Item f Frequency 3 root 30
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “e” { (a: 1, b: 1, d: 1, e: 1), (a: 1, b: 1, e: 1), (b: 1, d: 1, e: 1) e: 1 g: 1 2 b 3 d 2 e 3 g g: 1 d: 1 e: 1 f: 1 g: 1 } Frequency a f b: 1 f: 1 d: 2 g Item root Threshold = 3 0 COMP 5331 0 31
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 d: 1 g: 1 e: 1 f: 1 g Cond. FP-tree on “e” 3 { (a: 1, b: 1, d: 1, e: 1), (a: 1, b: 1, e: 1), (b: 1, d: 1, e: 1) Item Frequency f: 1 { (b: 1, e: 1), g: 1 (b: 1, e: 1), (b: 1, e: 1) } } Item Frequency a 2 b 3 e 3 d 2 e 3 f g 0 COMP 5331 0 Item b Head of node-link root b: 3 32
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “d” (a: 2, b: 2, d: 2), (b: 1, d: 1) } Item e: 1 g: 1 2 b 3 d 3 e 0 g g: 1 d: 1 e: 1 f: 1 g: 1 Frequency a f b: 1 f: 1 d: 2 g { root Threshold = 3 0 COMP 5331 0 33
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 d: 1 g: 1 e: 1 f: 1 g Cond. FP-tree on “d” 3 { { (b: 2, (a: 2, b: 2, d: 2), (b: 1, d: 1) } Item Frequency f: 1 d: 2), g: 1 (b: 1, d: 1) } Item Frequency a 2 b 3 d 3 e 0 f g 0 COMP 5331 0 Item b Head of node-link root b: 3 34
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “b” (a: 3, b: 3), (b: 1) } Item e: 1 g: 1 e: 1 f: 1 Frequency 3 b 4 d 0 e 0 g g: 1 d: 1 g: 1 a f b: 1 f: 1 d: 2 g { root Threshold = 3 0 COMP 5331 0 35
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 d: 1 g: 1 e: 1 f: 1 g Cond. FP-tree on “b” 4 { { (b: 3, (a: 3, b: 3), (b: 1) } Item f: 1 g: 1 (b: 1) Frequency Item 3 b 4 a 3 d 0 e 0 g 0 COMP 5331 0 } Frequency a f a: 3), Item a Head of node-link root a: 3 36
Item Head of node-link a: 4 a b: 3 b d e f Cond. FP-tree on “a” { (a: 4) } e: 1 g: 1 e: 1 f: 1 Frequency 4 b 0 d 0 e 0 g g: 1 d: 1 g: 1 a f b: 1 f: 1 d: 2 g Item root Threshold = 3 0 COMP 5331 0 37
Item Head of node-link root Threshold = 3 a: 4 a b: 3 b d e f b: 1 f: 1 d: 2 e: 1 g: 1 Frequency a 4 b 0 d 0 e 0 f g f: 1 { (a: 4) } Item e: 1 f: 1 g Cond. FP-tree on “a” 4 d: 1 0 COMP 5331 0 Item a Frequency 4 root 38
FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FPconditional tree for each item (or itemset). Step 4: Determine the frequent patterns. COMP 5331 39
Cond. FP-tree on “g” 3 COMP 5331 40
Cond. FP-tree on “g” 3 Item Head of node-link a root a: 3 Cond. FP-tree on “f” 3 root Cond. FP-tree on “e” 3 COMP 5331 41
Cond. FP-tree on “g” 3 Item Head of node-link a Cond. FP-tree on “d” 3 root a: 3 Cond. FP-tree on “f” 3 root Cond. FP-tree on “e” 3 Item Head of node-link b COMP 5331 root b: 3 42
Cond. FP-tree on “g” 3 Item Head of node-link a Cond. FP-tree on “d” 3 root a: 3 Item b Head of node-link root b: 3 Cond. FP-tree on “b” 4 Cond. FP-tree on “f” 3 root Cond. FP-tree on “e” 3 Item Head of node-link b COMP 5331 root b: 3 43
Cond. FP-tree on “g” 3 Item Head of node-link a Cond. FP-tree on “d” 3 root a: 3 Item Head of node-link b root b: 3 Cond. FP-tree on “b” 4 Cond. FP-tree on “f” 3 root Item Head of node-link a Head of node-link b COMP 5331 a: 3 Cond. FP-tree on “a” 4 Cond. FP-tree on “e” 3 Item root b: 3 44
Cond. FP-tree on “g” 3 1. Before generating this Item Head of cond. tree, we generate node-link {g} (support = 3) a generating this 2. After cond. tree, we generate {a, g} (support = 3) Cond. FP-tree on “d” 3 root a: 3 b generating this 2. After 1. Before generating this cond. tree, we generate {f} (support = 3) b: 3 Cond. FP-tree on “b” 4 root 1. Before Item generating Head of this cond. tree, we generate node-link {b} (support = 4) a 2. After generating this cond. tree, we do not generate any itemset. root a: 3 cond. tree, we generate {a, b} (support = 3) Cond. FP-tree on “a” 4 Cond. FP-tree on “e” 3 2. After b generating this cond. tree, we generate COMP 5331 {b, e} (support = 3) root cond. tree, we generate {b, d} (support = 3) Cond. FP-tree on “f” 3 1. Before generating this Itemtree, Head of cond. we generate node-link {e} (support = 3) 1. Before generating this Itemtree, Head of cond. we generate node-link {d} (support = 3) root b: 3 1. Before generating this cond. tree, we generate {a} (support = 4) 2. After generating this cond. tree, we do not generate any itemset. root 45
Complexity n Complexity in building FP-tree n Two scans of the transactions DB n n n Collect frequent items Construct the FP-tree Cost to insert one transaction n Number of frequent items in this transaction COMP 5331 46
Size of the FP-tree n The size of the FP-tree is bounded by the overall occurrences of the frequent items in the database COMP 5331 47
Height of the Tree n The height of the tree is bounded by the maximum number of frequent items in any transaction in the database COMP 5331 48
Compression n With respect to the total number of items stored, n is FP-tree more compressed compared with the original databases? COMP 5331 49
Details of the Algorithm n Procedure FP-growth (Tree, ) n if Tree contains a single path P n for each combination (denoted by ) of the nodes in the path P do n n generate pattern U with support = minimum support of nodes in else n for each ai in the header table of Tree do n n n COMP 5331 generate pattern = ai U with support = ai. support construct ’s conditional pattern base and then ’s conditional FP-tree Tree if Tree n Call FP-growth(Tree , ) 50
- Slides: 50