What Is Frequent Pattern Analysis n Frequent pattern

Frequent item sets n n Set of items – itemset Itemset with ‘k’ items

Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, D

Two – step process of association mining n n Find all frequent itemsets: more

Closed Patterns and Max-Patterns n n n A long pattern contains a combinatorial number

Scalable Methods for Mining Frequent Patterns n n The downward closure property of frequent

Frequent pattern mining n n n n Frequent pattern mining can be classified in

Apriori: A Candidate Generation-and-Test Approach n n Apriori pruning principle: If there is any

The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B,

The Apriori Algorithm n Pseudo-code: Ck: Candidate itemset of size k Lk : frequent

Important Details of Apriori n How to generate candidates? n Step 1: self-joining Lk

How to Count Supports of Candidates? n Why counting supports of candidates a problem?

TID List of item_IDs T 100 I 1, I 2, I 5 T 200

Generating Association Rules from Frequent Itemsets n n n Strong association rules satisfy both

Generating Association rules n Considering the frequent itemset l={I 1, I 2, I 5}

Improving the efficiency of apriori n n Hash-based technique: While generating the candidate 1

Improving the efficiency of Apriori n Partitioning: n Requires two database scans n Consists

Improving the efficiency of Apriori n Sampling: n Pick a random sample S of

Dynamic itemset counting n n In this technique candidate itemsets is added at different

Challenges of Frequent Pattern Mining n n Challenges n Multiple scans of transaction database

Partition: Scan Database Only Twice n Any itemset that is potentially frequent in DB

DHP: Reduce the Number of Candidates n A k-itemset whose corresponding hashing bucket count

Sampling for Frequent Patterns n Select a sample of original database, mine frequent patterns

DIC: Reduce Number of Scans ABCD n ABC ABD ACD BCD AB AC BC

Bottleneck of Frequent-pattern Mining n n Multiple database scans are costly Mining long patterns

Mining Frequent Patterns Without Candidate Generation n Grow long patterns from short ones using

Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought

Benefits of the FP-tree Structure n n Completeness n Preserve complete information for frequent

Partition Patterns and Databases n n Frequent patterns can be partitioned into subsets according

Find Patterns Having P From P-conditional Database n n n Starting at the frequent

From Conditional Pattern-bases to Conditional FP-trees n For each pattern-base n Accumulate the count

Recursion: Mining Each Conditional FP-tree {} {} Cond. pattern base of “am”: (fc: 3)

A Special Case: Single Prefix Path in FP-tree n n {} a 1: n

Mining Frequent Patterns With FP-trees n n Idea: Frequent pattern growth n Recursively grow

Scaling FP-growth by DB Projection n n FP-tree cannot fit in memory? —DB projection

Partition-based Projection n n Parallel projection needs a lot of disk space Partition projection

FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T 25 I 20

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T 25 I 20

Why Is FP-Growth the Winner? n Divide-and-conquer: n n n decompose both the mining

Implications of the Methodology n Mining closed frequent itemsets and max-patterns n n Mining

Max. Miner: Mining Max-patterns n 1 st scan: find frequent items n n A,

Mining Frequent Closed Patterns: CLOSET n Flist: list of all frequent items in support

CLOSET+: Mining Closed Itemsets by Pattern-Growth n n n Itemset merging: if Y appears

CHARM: Mining by Exploring Vertical Data Format n Vertical format: t(AB) = {T 11,

Further Improvements of Mining Methods n n AFOPT (Liu, et al. @ KDD’ 03)

Visualization of Association Rules: Plane Graph 10 March 2021 Data Mining: Concepts and Techniques

Visualization of Association Rules: Rule Graph 10 March 2021 Data Mining: Concepts and Techniques

Visualization of Association Rules (SGI/Mine. Set 3. 0) 10 March 2021 Data Mining: Concepts

Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a

Mining Various Kinds of Association Rules n Mining multilevel association n Miming multidimensional association

Mining Multiple-Level Association Rules n n n Items often form hierarchies Flexible support settings

Multi-level Association: Redundancy Filtering n n Some rules may be redundant due to “ancestor”

Mining Multi-Dimensional Association n Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) n Multi-dimensional rules: 2

Mining Quantitative Associations n 1. 2. 3. Techniques can be categorized by how numerical

Static Discretization of Quantitative Attributes n Discretized prior to mining using concept hierarchy. n

Quantitative Association Rules n n n Proposed by Lent, Swami and Widom ICDE’ 97

Mining Other Interesting Patterns n Flexible support constraints (Wang et al. @ VLDB’ 02)

Interestingness Measure: Correlations (Lift) n play basketball eat cereal [40%, 66. 7%] is misleading

Are lift and 2 Good Measures of Correlation? n “Buy walnuts buy milk [1%,

Which Measures Should Be Used? n n lift and 2 are not good measures

Constraint-based (Query-Directed) Mining n Finding all the patterns in a database autonomously? — unrealistic!

Constraints in Data Mining n n n Knowledge type constraint: n classification, association, etc.

Constrained Mining vs. Constraint-Based Search n n Constrained mining vs. constraint-based search/reasoning n Both

Anti-Monotonicity in Constraint Pushing TDB (min_sup=2) n Anti-monotonicity n n When an intemset S

Monotonicity for Constraint Pushing TDB (min_sup=2) n Monotonicity n n When an intemset S

Succinctness n Succinctness: n n n Given A 1, the set of items satisfying

The Apriori Algorithm — Example Database D L 1 C 1 Scan D C

Naïve Algorithm: Apriori + Constraint Database D L 1 C 1 Scan D C

The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D L 1 C

The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D L 1 C

Converting “Tough” Constraints TDB (min_sup=2) n n Convert tough constraints into antimonotone or monotone

Strongly Convertible Constraints n avg(X) 25 is convertible anti-monotone w. r. t. item value

Can Apriori Handle Convertible Constraint? n n A convertible, not monotone nor anti-monotone nor

Mining With Convertible Constraints n n Item Value C: avg(X) >= 25, min_sup=2 a

Handling Multiple Constraints n n n Different constraints may require different or even conflicting

What Constraints Are Convertible? Constraint Convertible antimonotone Convertible monotone Strongly convertible avg(S) , v

Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct v S no yes yes S

A Classification of Constraints Monotone Antimonotone Succinct Strongly convertible Convertible anti-monotone Convertible monotone Inconvertible

Frequent-Pattern Mining: Summary n Frequent pattern mining—an important task in data mining n Scalable

Frequent-Pattern Mining: Research Problems n Mining fault-tolerant frequent, sequential and structured patterns n n

Slides: 88

Download presentation

What Is Frequent Pattern Analysis? n Frequent pattern: a pattern (a set of items, subsequences, substructures, etc. ) that occurs frequently in a data set n n Motivation: Finding inherent regularities in data n What products were often purchased together? n What are the subsequent purchases after buying a PC? n What kinds of DNA are sensitive to this new drug? n Can we automatically classify web documents? Applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 1

Frequent item sets n n Set of items – itemset Itemset with ‘k’ items is called ‘k-itemset’ Occurrence of itemset – number of transactions that contain the itemset (frequency) If the support of an itemset satisfies a minimum support threshold then it is called as frequent itemset Confidence(A=>B) = P(B|A) = support(AUB)/Support A 10 March 2021 2

Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F n n Itemset X = {x 1, …, xk} Find all the rules X Y with minimum support and confidence n n Customer buys both Customer buys diaper support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Freq. Pat. : {A: 3, B: 3, D: 4, E: 3, AD: 3} Customer buys beer Association rules: A D (60%, 100%) D A (60%, 75%) 3

Two – step process of association mining n n Find all frequent itemsets: more than min-support Generate strong association rules from the frequent itemsets: rules that support minimum support and minimum confidence 10 March 2021 Data Mining: Concepts and Techniques 4

10 March 2021 5

10 March 2021 6

10 March 2021 7

10 March 2021 8

10 March 2021 9

Closed Patterns and Max-Patterns n n n A long pattern contains a combinatorial number of subpatterns, e. g. , {a 1, …, a 100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1. 27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns n Reducing the # of patterns and rules 10

Scalable Methods for Mining Frequent Patterns n n The downward closure property of frequent patterns n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper} n i. e. , every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches n Apriori (Agrawal & Srikant@VLDB’ 94) n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’ 00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM’ 02) 10 March 2021 13

Frequent pattern mining n n n n Frequent pattern mining can be classified in various ways Based on the completeness of pattern to be mined Based on the levels of abstraction Based on the number of data dimension Based on the types of values Based on the kinds of rules to be mining Based on the kinds of patterns to be mined 10 March 2021 14

Apriori: A Candidate Generation-and-Test Approach n n Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’ 94, Mannila, et al. @ KDD’ 94) Method: n n Initially, scan DB once to get frequent 1 -itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 10 March 2021 15

The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 Itemset {A, C} {B, E} {C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C 1 1 st scan C 2 L 2 Itemset sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 3 2 L 1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 C 2 2 nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C 3 Itemset {B, C, E} 10 March 2021 3 rd scan L 3 Itemset sup {B, C, E} 2 Data Mining: Concepts and Techniques 16

The Apriori Algorithm n Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 10 March 2021 Data Mining: Concepts and Techniques 17

Important Details of Apriori n How to generate candidates? n Step 1: self-joining Lk n Step 2: pruning n How to count supports of candidates? n Example of Candidate-generation n n L 3={abc, abd, ace, bcd} Self-joining: L 3*L 3 n n abcd from abc and abd acde from acd and ace Pruning: n acde is removed because ade is not in L 3 C 4={abcd} 10 March 2021 Data Mining: Concepts and Techniques 18

How to Count Supports of Candidates? n Why counting supports of candidates a problem? n n n The total number of candidates can be very huge One transaction may contain many candidates Method: n Candidate itemsets are stored in a hash-tree n Leaf node of hash-tree contains a list of itemsets and counts n n Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 10 March 2021 Data Mining: Concepts and Techniques 19

TID List of item_IDs T 100 I 1, I 2, I 5 T 200 I 2, I 4 T 300 I 2, I 3 T 400 I 1, I 2, I 4 T 500 I 1, I 3 T 600 I 2, I 3 T 700 I 1, I 3 T 800 I 1, I 2, I 3, I 5 T 900 I 1, I 2, I 3 10 March 2021 20

Generating Association Rules from Frequent Itemsets n n n Strong association rules satisfy both minimum support and minimum confidence For each frequent itemset l, generate all nonempty sunsets of l. For every nonempty subset s of l, output the rule “s=>(l-s)” if support_count(l)/support_count(s)>=min_conf where min_conf is the minimum confidence threshold 10 March 2021 Data Mining: Concepts and Techniques 21

Generating Association rules n Considering the frequent itemset l={I 1, I 2, I 5} and minimum confidence is 70% find the strong association rules 10 March 2021 22

Improving the efficiency of apriori n n Hash-based technique: While generating the candidate 1 - itemsets , we can generate all of the 2 -itemsets for each transaction, hash them into different buckets of a hash table structure and increase the corresponding bucket counts Transaction Reduction: if transaction does not contain frequent k-itemsets cannot contain any (k+1) itemsets 10 March 2021 23

Improving the efficiency of Apriori n Partitioning: n Requires two database scans n Consists of two phases n n n 10 March 2021 Phase I: divides the transactions into n nonoverlapping partitions (min sup=min_sup* # transactions). Local frequent itemsets are found Any itemset frequent in D should be frequent in atleast one partition Phase II: scan D to determine the actual support of each candidate to access the global frequent itemsets. 24

Improving the efficiency of Apriori n Sampling: n Pick a random sample S of D, search for frequent itemsets in S n To lessen the possibility of missing some global frequent itemsets lower the minimum support n Rest of database is then checked to find the actual frequencies of each itemset. n If the min support of the sample contains all the frequent itemsets in D then only one scan of D is required 10 March 2021 25

Dynamic itemset counting n n In this technique candidate itemsets is added at different points during a scan Dbase is partitioned into blocks marked by start points New candidate itemsets can be added at any start point Algo requires fewer database scans than apriori 10 March 2021 Data Mining: Concepts and Techniques 26

Challenges of Frequent Pattern Mining n n Challenges n Multiple scans of transaction database n Huge number of candidates n Tedious workload of support counting for candidates Improving Apriori: general ideas n Reduce passes of transaction database scans n Shrink number of candidates n Facilitate support counting of candidates 10 March 2021 Data Mining: Concepts and Techniques 27

Partition: Scan Database Only Twice n Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB n Scan 1: partition database and find local frequent patterns n n Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’ 95 10 March 2021 Data Mining: Concepts and Techniques 28

DHP: Reduce the Number of Candidates n A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent n Candidates: a, b, c, d, e n Hash entries: {ab, ad, ae} {bd, be, de} … n Frequent 1 -itemset: a, b, d, e n ab is not a candidate 2 -itemset if the sum of count of {ab, ad, ae} is below support threshold n J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’ 95 10 March 2021 Data Mining: Concepts and Techniques 29

Sampling for Frequent Patterns n Select a sample of original database, mine frequent patterns within sample using Apriori n Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked n Example: check abcd instead of ab, ac, …, etc. n Scan database again to find missed frequent patterns n H. Toivonen. Sampling large databases for association rules. In VLDB’ 96 10 March 2021 Data Mining: Concepts and Techniques 30

DIC: Reduce Number of Scans ABCD n ABC ABD ACD BCD AB AC BC AD BD n CD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions B A C D Apriori {} Itemset lattice S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset DIC counting and implication rules for market basket data. In SIGMOD’ 97 10 March 2021 1 -itemsets 2 -itemsets … 1 -itemsets 2 -items Data Mining: Concepts and Techniques 3 -items 31

Bottleneck of Frequent-pattern Mining n n Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates n To find frequent itemset i 1 i 2…i 100 n n # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = 21001 = 1. 27*1030 ! n Bottleneck: candidate-generation-and-test n Can we avoid candidate generation? 10 March 2021 Data Mining: Concepts and Techniques 32

Mining Frequent Patterns Without Candidate Generation n Grow long patterns from short ones using local frequent items n “abc” is a frequent pattern n Get all transactions having “abc”: DB|abc n “d” is a local frequent item in DB|abc abcd is a frequent pattern 10 March 2021 Data Mining: Concepts and Techniques 33

Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1 -itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 10 March 2021 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 m: 2 b: 1 p: 2 m: 1 34

Benefits of the FP-tree Structure n n Completeness n Preserve complete information for frequent pattern mining n Never break a long pattern of any transaction Compactness n Reduce irrelevant info—infrequent items are gone n Items in frequency descending order: the more frequently occurring, the more likely to be shared n Never be larger than the original database (not count node-links and the count field) n For Connect-4 DB, compression ratio could be over 100 10 March 2021 Data Mining: Concepts and Techniques 35

Partition Patterns and Databases n n Frequent patterns can be partitioned into subsets according to f-list n F-list=f-c-a-b-m-p n Patterns containing p n Patterns having m but no p n … n Patterns having c but no a nor b, m, p n Pattern f Completeness and non-redundency 10 March 2021 Data Mining: Concepts and Techniques 36

Find Patterns Having P From P-conditional Database n n n Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 10 March 2021 f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 Conditional pattern bases item cond. pattern base c f: 3 a fc: 3 b fca: 1, f: 1, c: 1 m: 2 b: 1 m fca: 2, fcab: 1 p: 2 m: 1 p fcam: 2, cb: 1 Data Mining: Concepts and Techniques 37

From Conditional Pattern-bases to Conditional FP-trees n For each pattern-base n Accumulate the count for each item in the base n Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 10 March 2021 m-conditional pattern base: fca: 2, fcab: 1 {} f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 {} p: 1 f: 3 m: 2 b: 1 c: 3 p: 2 m: 1 a: 3 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree Data Mining: Concepts and Techniques 38

Recursion: Mining Each Conditional FP-tree {} {} Cond. pattern base of “am”: (fc: 3) c: 3 f: 3 c: 3 a: 3 f: 3 am-conditional FP-tree Cond. pattern base of “cm”: (f: 3) {} f: 3 m-conditional FP-tree cm-conditional FP-tree {} Cond. pattern base of “cam”: (f: 3) f: 3 cam-conditional FP-tree 10 March 2021 Data Mining: Concepts and Techniques 39

A Special Case: Single Prefix Path in FP-tree n n {} a 1: n 1 a 2: n 2 Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts n n Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a 3: n 3 b 1: m 1 C 2: k 2 10 March 2021 r 1 {} C 1: k 1 C 3: k 3 r 1 = a 1: n 1 a 2: n 2 + a 3: n 3 Data Mining: Concepts and Techniques b 1: m 1 C 2: k 2 C 1: k 1 C 3: k 3 40

Mining Frequent Patterns With FP-trees n n Idea: Frequent pattern growth n Recursively grow frequent patterns by pattern and database partition Method n For each frequent item, construct its conditional pattern -base, and then its conditional FP-tree n Repeat the process on each newly created conditional FP-tree n Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 10 March 2021 Data Mining: Concepts and Techniques 41

Scaling FP-growth by DB Projection n n FP-tree cannot fit in memory? —DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques n Parallel projection is space costly 10 March 2021 Data Mining: Concepts and Techniques 42

Partition-based Projection n n Parallel projection needs a lot of disk space Partition projection saves it p-proj DB fcam cb fcamp fcabm fb cbp fcamp m-proj DB b-proj DB fcab fca am-proj DB fc fc fc 10 March 2021 Tran. DB f cb … a-proj DB fc … cm-proj DB f f f c-proj DB f … f-proj DB … … Data Mining: Concepts and Techniques 43

FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T 25 I 20 D 10 K 10 March 2021 Data Mining: Concepts and Techniques 44

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T 25 I 20 D 100 K 10 March 2021 Data Mining: Concepts and Techniques 45

Why Is FP-Growth the Winner? n Divide-and-conquer: n n n decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors n no candidate generation, no candidate test n compressed database: FP-tree structure n no repeated scan of entire database n basic ops—counting local freq items and building sub FP-tree, no pattern search and matching 10 March 2021 Data Mining: Concepts and Techniques 46

Implications of the Methodology n Mining closed frequent itemsets and max-patterns n n Mining sequential patterns n n Free. Span (KDD’ 00), Prefix. Span (ICDE’ 01) Constraint-based mining of frequent patterns n n CLOSET (DMKD’ 00) Convertible constraints (KDD’ 00, ICDE’ 01) Computing iceberg data cubes with complex measures n H-tree and H-cubing algorithm (SIGMOD’ 01) 10 March 2021 Data Mining: Concepts and Techniques 47

Max. Miner: Mining Max-patterns n 1 st scan: find frequent items n n A, B, C, D, E 2 nd scan: find support for n AB, AC, AD, AE, ABCDE n BC, BD, BE, BCDE n CD, CE, CDE, Tid Items 10 A, B, C, D, E 20 B, C, D, E, 30 A, C, D, F Potential maxpatterns Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’ 98 10 March 2021 Data Mining: Concepts and Techniques 48

Mining Frequent Closed Patterns: CLOSET n Flist: list of all frequent items in support ascending order n n n Min_sup=2 Divide search space n Patterns having d but no a, etc. Find frequent closed pattern recursively n n Flist: d-a-f-e-c TID 10 20 30 40 50 Items a, c, d, e, f a, b, e c, e, f a, c, d, f c, e, f Every transaction having d also has cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. 10 March 2021 Data Mining: Concepts and Techniques 49

CLOSET+: Mining Closed Itemsets by Pattern-Growth n n n Itemset merging: if Y appears in every occurrence of X, then Y is merged with X Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned Hybrid tree projection n Bottom-up physical tree-projection n Top-down pseudo tree-projection Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels Efficient subset checking 10 March 2021 Data Mining: Concepts and Techniques 50

CHARM: Mining by Exploring Vertical Data Format n Vertical format: t(AB) = {T 11, T 25, …} n n tid-list: list of trans. -ids containing an itemset Deriving closed patterns based on vertical intersections n t(X) = t(Y): X and Y always happen together n t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining n Only keep track of differences of tids n t(X) = {T 1, T 2, T 3}, t(XY) = {T 1, T 3} n Diffset (XY, X) = {T 2} Eclat/Max. Eclat (Zaki et al. @KDD’ 97), VIPER(P. Shenoy et al. @SIGMOD’ 00), CHARM (Zaki & Hsiao@SDM’ 02) 10 March 2021 Data Mining: Concepts and Techniques 51

Further Improvements of Mining Methods n n AFOPT (Liu, et al. @ KDD’ 03) n A “push-right” method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. @ KDD’ 03) n Mine data sets with small rows but numerous columns n Construct a row-enumeration tree for efficient mining 10 March 2021 Data Mining: Concepts and Techniques 52

Visualization of Association Rules: Plane Graph 10 March 2021 Data Mining: Concepts and Techniques 53

Visualization of Association Rules: Rule Graph 10 March 2021 Data Mining: Concepts and Techniques 54

Visualization of Association Rules (SGI/Mine. Set 3. 0) 10 March 2021 Data Mining: Concepts and Techniques 55

Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 56

Mining Various Kinds of Association Rules n Mining multilevel association n Miming multidimensional association n Mining quantitative association n Mining interesting correlation patterns 10 March 2021 Data Mining: Concepts and Techniques 57

Mining Multiple-Level Association Rules n n n Items often form hierarchies Flexible support settings n Items at the lower level are expected to have lower support Exploration of shared multi-level mining (Agrawal & Srikant@VLB’ 95, Han & Fu@VLDB’ 95) reduced support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% 10 March 2021 Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Data Mining: Concepts and Techniques Level 1 min_sup = 5% Level 2 min_sup = 3% 58

Multi-level Association: Redundancy Filtering n n Some rules may be redundant due to “ancestor” relationships between items. Example n milk wheat bread n 2% milk wheat bread [support = 2%, confidence = 72%] [support = 8%, confidence = 70%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 10 March 2021 Data Mining: Concepts and Techniques 59

Mining Multi-Dimensional Association n Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) n Multi-dimensional rules: 2 dimensions or predicates n Inter-dimension assoc. rules (no repeated predicates) age(X, ” 19 -25”) occupation(X, “student”) buys(X, “coke”) n hybrid-dimension assoc. rules (repeated predicates) age(X, ” 19 -25”) buys(X, “popcorn”) buys(X, “coke”) n n Categorical Attributes: finite number of possible values, no ordering among values—data cube approach Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches 10 March 2021 Data Mining: Concepts and Techniques 60

Mining Quantitative Associations n 1. 2. 3. Techniques can be categorized by how numerical attributes, such as age or salary are treated Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e. g. , Agrawal & Srikant@SIGMOD 96) Clustering: Distance-based association (e. g. , Yang & Miller@SIGMOD 97) n 4. one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD 99) Sex = female => Wage: mean=$7/hr (overall mean = $9) 10 March 2021 Data Mining: Concepts and Techniques 61

Static Discretization of Quantitative Attributes n Discretized prior to mining using concept hierarchy. n Numeric values are replaced by ranges. n In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. n Data cube is well suited for mining. n The cells of an n-dimensional () (age) (income) (buys) cuboid correspond to the predicate sets. n Mining from data cubes can be much faster. 10 March 2021 (age, income) (age, buys) (income, buys) (age, income, buys) Data Mining: Concepts and Techniques 62

Quantitative Association Rules n n n Proposed by Lent, Swami and Widom ICDE’ 97 Numeric attributes are dynamically discretized n Such that the confidence or compactness of the rules mined is maximized 2 -D quantitative association rules: Aquan 1 Aquan 2 Acat Cluster adjacent association rules to form general rules using a 2 -D grid Example age(X, ” 34 -35”) income(X, ” 30 -50 K”) buys(X, ”high resolution TV”) 10 March 2021 Data Mining: Concepts and Techniques 63

Mining Other Interesting Patterns n Flexible support constraints (Wang et al. @ VLDB’ 02) n n n Some items (e. g. , diamond) may occur rarely but are valuable Customized supmin specification and application Top-K closed frequent patterns (Han, et al. @ ICDM’ 02) n n Hard to specify supmin, but top-k with lengthmin is more desirable Dynamically raise supmin in FP-tree construction and mining, and select most promising path to mine 10 March 2021 Data Mining: Concepts and Techniques 64

Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods n Mining various kinds of association rules n From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 65

Interestingness Measure: Correlations (Lift) n play basketball eat cereal [40%, 66. 7%] is misleading n n The overall % of students eating cereal is 75% > 66. 7%. play basketball not eat cereal [20%, 33. 3%] is more accurate, although with lower support and confidence n Measure of dependent/correlated events: lift 10 March 2021 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col. ) 3000 2000 5000 Data Mining: Concepts and Techniques 66

Are lift and 2 Good Measures of Correlation? n “Buy walnuts buy milk [1%, 80%]” is misleading n if 85% of customers buy milk n Support and confidence are not good to represent correlations n So many interestingness measures? (Tan, Kumar, Sritastava @KDD’ 02) 10 March 2021 Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~c Sum(col. ) m ~m all-conf coh 2 9. 26 0. 91 0. 83 9055 100, 000 8. 44 0. 09 0. 05 670 10000 100, 000 9. 18 0. 09 8172 1000 1 0. 5 0. 33 0 DB m, c ~m, c m~c ~m~c lift A 1 1000 100 10, 000 A 2 1000 A 3 1000 100 A 4 1000 Data Mining: Concepts and Techniques 67

Which Measures Should Be Used? n n lift and 2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures (Omiecinski@TKDE’ 03) n n Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’ 03 sub) 10 March 2021 Data Mining: Concepts and Techniques 68

Constraint-based (Query-Directed) Mining n Finding all the patterns in a database autonomously? — unrealistic! n n Data mining should be an interactive process n n The patterns could be too many but not focused! User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining n n User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining 10 March 2021 Data Mining: Concepts and Techniques 70

Constraints in Data Mining n n n Knowledge type constraint: n classification, association, etc. Data constraint — using SQL-like queries n find product pairs sold together in stores in Chicago in Dec. ’ 02 Dimension/level constraint n in relevance to region, price, brand, customer category Rule (or pattern) constraint n small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint n strong rules: min_support 3%, min_confidence 60% 10 March 2021 Data Mining: Concepts and Techniques 71

Constrained Mining vs. Constraint-Based Search n n Constrained mining vs. constraint-based search/reasoning n Both are aimed at reducing search space n Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI n Constraint-pushing vs. heuristic search n It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS n Database query processing requires to find all n Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing 10 March 2021 Data Mining: Concepts and Techniques 72

Anti-Monotonicity in Constraint Pushing TDB (min_sup=2) n Anti-monotonicity n n When an intemset S violates the constraint, so does any of its superset sum(S. Price) v is anti-monotone sum(S. Price) v is not anti-monotone Example. C: range(S. profit) 15 is antimonotone TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 n Itemset ab violates C d 10 n So does every superset of ab e -30 f 30 g 20 h -10 10 March 2021 Data Mining: Concepts and Techniques 73

Monotonicity for Constraint Pushing TDB (min_sup=2) n Monotonicity n n When an intemset S satisfies the constraint, so does any of its superset sum(S. Price) v is monotone min(S. Price) v is monotone Example. C: range(S. profit) 15 n Itemset ab satisfies C n So does every superset of ab 10 March 2021 Data Mining: Concepts and Techniques TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 74

Succinctness n Succinctness: n n n Given A 1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A 1 , i. e. , S contains a subset belonging to A 1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items n min(S. Price) v is succinct n sum(S. Price) v is not succinct Optimization: If C is succinct, C is pre-counting pushable 10 March 2021 Data Mining: Concepts and Techniques 75

The Apriori Algorithm — Example Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 10 March 2021 Scan D L 3 Data Mining: Concepts and Techniques 76

Naïve Algorithm: Apriori + Constraint Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 Scan D L 3 Constraint: Sum{S. price} < 5 10 March 2021 Data Mining: Concepts and Techniques 77

The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 Scan D L 3 Constraint: Sum{S. price} < 5 10 March 2021 Data Mining: Concepts and Techniques 78

The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D L 1 C 1 Scan D C 2 Scan D L 2 not immediately to be used C 3 Scan D L 3 Constraint: min{S. price } <= 1 10 March 2021 Data Mining: Concepts and Techniques 79

Converting “Tough” Constraints TDB (min_sup=2) n n Convert tough constraints into antimonotone or monotone by properly ordering items Examine C: avg(S. profit) 25 n Order items in value-descending order n n <a, f, g, d, b, h, c, e> If an itemset afb violates C TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 -30 n So does afbh, afb* e f 30 n It becomes anti-monotone! g 20 h -10 10 March 2021 Data Mining: Concepts and Techniques 80

Strongly Convertible Constraints n avg(X) 25 is convertible anti-monotone w. r. t. item value descending order R: <a, f, g, d, b, h, c, e> n If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd n n avg(X) 25 is convertible monotone w. r. t. item value ascending order R-1: <e, c, h, b, d, g, f, a> n If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X) 25 is strongly convertible 10 March 2021 Data Mining: Concepts and Techniques Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 81

Can Apriori Handle Convertible Constraint? n n A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm n Within the level wise framework, no direct pruning based on the constraint can be made n Itemset df violates constraint C: avg(X)>=25 n Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework! 10 March 2021 Data Mining: Concepts and Techniques Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 82

Mining With Convertible Constraints n n Item Value C: avg(X) >= 25, min_sup=2 a 40 List items in every transaction in value descending order R: <a, f, g, d, b, h, c, e> f 30 g 20 d 10 b 0 h -10 c -20 e -30 n n C is convertible anti-monotone w. r. t. R Scan TDB once n remove infrequent items n n n Item h is dropped Itemsets a and f are good, … Projection-based mining n n Imposing an appropriate order on item projection Many tough constraints can be converted into (anti)-monotone 10 March 2021 Data Mining: Concepts and Techniques TDB (min_sup=2) TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e 83

Handling Multiple Constraints n n n Different constraints may require different or even conflicting item-ordering If there exists an order R s. t. both C 1 and C 2 are convertible w. r. t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items n n Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database 10 March 2021 Data Mining: Concepts and Techniques 84

What Constraints Are Convertible? Constraint Convertible antimonotone Convertible monotone Strongly convertible avg(S) , v Yes Yes median(S) , v Yes Yes sum(S) v (items could be of any value, v 0) Yes No No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) Yes No No …… 10 March 2021 Data Mining: Concepts and Techniques 85

Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct v S no yes yes S V yes no yes min(S) v yes no yes max(S) v no yes count(S) v yes no weakly count(S) v no yes weakly sum(S) v ( a S, a 0 ) yes no no sum(S) v ( a S, a 0 ) no yes no range(S) v yes no no range(S) v no yes no avg(S) v, { , , } convertible no support(S) yes no no support(S) no yes no 10 March 2021 Data Mining: Concepts and Techniques 86

A Classification of Constraints Monotone Antimonotone Succinct Strongly convertible Convertible anti-monotone Convertible monotone Inconvertible 10 March 2021 Data Mining: Concepts and Techniques 87

Frequent-Pattern Mining: Summary n Frequent pattern mining—an important task in data mining n Scalable frequent pattern mining methods n Apriori (Candidate generation & test) n Projection-based (FPgrowth, CLOSET+, . . . ) n Vertical format approach (CHARM, . . . ) § Mining a variety of rules and interesting patterns § Constraint-based mining § Mining sequential and structured patterns § Extensions and applications 10 March 2021 Data Mining: Concepts and Techniques 89

Frequent-Pattern Mining: Research Problems n Mining fault-tolerant frequent, sequential and structured patterns n n Mining truly interesting patterns n n Patterns allows limited faults (insertion, deletion, mutation) Surprising, novel, concise, … Application exploration n n E. g. , DNA sequence analysis and bio-pattern classification “Invisible” data mining 10 March 2021 Data Mining: Concepts and Techniques 90