What Is Frequent Pattern Analysis n Frequent pattern
- Slides: 88
What Is Frequent Pattern Analysis? n Frequent pattern: a pattern (a set of items, subsequences, substructures, etc. ) that occurs frequently in a data set n n Motivation: Finding inherent regularities in data n What products were often purchased together? n What are the subsequent purchases after buying a PC? n What kinds of DNA are sensitive to this new drug? n Can we automatically classify web documents? Applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 1
Frequent item sets n n Set of items – itemset Itemset with ‘k’ items is called ‘k-itemset’ Occurrence of itemset – number of transactions that contain the itemset (frequency) If the support of an itemset satisfies a minimum support threshold then it is called as frequent itemset Confidence(A=>B) = P(B|A) = support(AUB)/Support A 10 March 2021 2
Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F n n Itemset X = {x 1, …, xk} Find all the rules X Y with minimum support and confidence n n Customer buys both Customer buys diaper support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Freq. Pat. : {A: 3, B: 3, D: 4, E: 3, AD: 3} Customer buys beer Association rules: A D (60%, 100%) D A (60%, 75%) 3
Two – step process of association mining n n Find all frequent itemsets: more than min-support Generate strong association rules from the frequent itemsets: rules that support minimum support and minimum confidence 10 March 2021 Data Mining: Concepts and Techniques 4
10 March 2021 5
10 March 2021 6
10 March 2021 7
10 March 2021 8
10 March 2021 9
Closed Patterns and Max-Patterns n n n A long pattern contains a combinatorial number of subpatterns, e. g. , {a 1, …, a 100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1. 27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns n Reducing the # of patterns and rules 10
Scalable Methods for Mining Frequent Patterns n n The downward closure property of frequent patterns n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper} n i. e. , every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches n Apriori (Agrawal & Srikant@VLDB’ 94) n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’ 00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM’ 02) 10 March 2021 13
Frequent pattern mining n n n n Frequent pattern mining can be classified in various ways Based on the completeness of pattern to be mined Based on the levels of abstraction Based on the number of data dimension Based on the types of values Based on the kinds of rules to be mining Based on the kinds of patterns to be mined 10 March 2021 14
Apriori: A Candidate Generation-and-Test Approach n n Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’ 94, Mannila, et al. @ KDD’ 94) Method: n n Initially, scan DB once to get frequent 1 -itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 10 March 2021 15
The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 Itemset {A, C} {B, E} {C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C 1 1 st scan C 2 L 2 Itemset sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 3 2 L 1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 C 2 2 nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C 3 Itemset {B, C, E} 10 March 2021 3 rd scan L 3 Itemset sup {B, C, E} 2 Data Mining: Concepts and Techniques 16
The Apriori Algorithm n Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 10 March 2021 Data Mining: Concepts and Techniques 17
Important Details of Apriori n How to generate candidates? n Step 1: self-joining Lk n Step 2: pruning n How to count supports of candidates? n Example of Candidate-generation n n L 3={abc, abd, ace, bcd} Self-joining: L 3*L 3 n n abcd from abc and abd acde from acd and ace Pruning: n acde is removed because ade is not in L 3 C 4={abcd} 10 March 2021 Data Mining: Concepts and Techniques 18
How to Count Supports of Candidates? n Why counting supports of candidates a problem? n n n The total number of candidates can be very huge One transaction may contain many candidates Method: n Candidate itemsets are stored in a hash-tree n Leaf node of hash-tree contains a list of itemsets and counts n n Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 10 March 2021 Data Mining: Concepts and Techniques 19
TID List of item_IDs T 100 I 1, I 2, I 5 T 200 I 2, I 4 T 300 I 2, I 3 T 400 I 1, I 2, I 4 T 500 I 1, I 3 T 600 I 2, I 3 T 700 I 1, I 3 T 800 I 1, I 2, I 3, I 5 T 900 I 1, I 2, I 3 10 March 2021 20
Generating Association Rules from Frequent Itemsets n n n Strong association rules satisfy both minimum support and minimum confidence For each frequent itemset l, generate all nonempty sunsets of l. For every nonempty subset s of l, output the rule “s=>(l-s)” if support_count(l)/support_count(s)>=min_conf where min_conf is the minimum confidence threshold 10 March 2021 Data Mining: Concepts and Techniques 21
Generating Association rules n Considering the frequent itemset l={I 1, I 2, I 5} and minimum confidence is 70% find the strong association rules 10 March 2021 22
Improving the efficiency of apriori n n Hash-based technique: While generating the candidate 1 - itemsets , we can generate all of the 2 -itemsets for each transaction, hash them into different buckets of a hash table structure and increase the corresponding bucket counts Transaction Reduction: if transaction does not contain frequent k-itemsets cannot contain any (k+1) itemsets 10 March 2021 23
Improving the efficiency of Apriori n Partitioning: n Requires two database scans n Consists of two phases n n n 10 March 2021 Phase I: divides the transactions into n nonoverlapping partitions (min sup=min_sup* # transactions). Local frequent itemsets are found Any itemset frequent in D should be frequent in atleast one partition Phase II: scan D to determine the actual support of each candidate to access the global frequent itemsets. 24
Improving the efficiency of Apriori n Sampling: n Pick a random sample S of D, search for frequent itemsets in S n To lessen the possibility of missing some global frequent itemsets lower the minimum support n Rest of database is then checked to find the actual frequencies of each itemset. n If the min support of the sample contains all the frequent itemsets in D then only one scan of D is required 10 March 2021 25
Dynamic itemset counting n n In this technique candidate itemsets is added at different points during a scan Dbase is partitioned into blocks marked by start points New candidate itemsets can be added at any start point Algo requires fewer database scans than apriori 10 March 2021 Data Mining: Concepts and Techniques 26
Challenges of Frequent Pattern Mining n n Challenges n Multiple scans of transaction database n Huge number of candidates n Tedious workload of support counting for candidates Improving Apriori: general ideas n Reduce passes of transaction database scans n Shrink number of candidates n Facilitate support counting of candidates 10 March 2021 Data Mining: Concepts and Techniques 27
Partition: Scan Database Only Twice n Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB n Scan 1: partition database and find local frequent patterns n n Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’ 95 10 March 2021 Data Mining: Concepts and Techniques 28
DHP: Reduce the Number of Candidates n A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent n Candidates: a, b, c, d, e n Hash entries: {ab, ad, ae} {bd, be, de} … n Frequent 1 -itemset: a, b, d, e n ab is not a candidate 2 -itemset if the sum of count of {ab, ad, ae} is below support threshold n J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’ 95 10 March 2021 Data Mining: Concepts and Techniques 29
Sampling for Frequent Patterns n Select a sample of original database, mine frequent patterns within sample using Apriori n Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked n Example: check abcd instead of ab, ac, …, etc. n Scan database again to find missed frequent patterns n H. Toivonen. Sampling large databases for association rules. In VLDB’ 96 10 March 2021 Data Mining: Concepts and Techniques 30
DIC: Reduce Number of Scans ABCD n ABC ABD ACD BCD AB AC BC AD BD n CD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions B A C D Apriori {} Itemset lattice S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset DIC counting and implication rules for market basket data. In SIGMOD’ 97 10 March 2021 1 -itemsets 2 -itemsets … 1 -itemsets 2 -items Data Mining: Concepts and Techniques 3 -items 31
Bottleneck of Frequent-pattern Mining n n Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates n To find frequent itemset i 1 i 2…i 100 n n # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = 21001 = 1. 27*1030 ! n Bottleneck: candidate-generation-and-test n Can we avoid candidate generation? 10 March 2021 Data Mining: Concepts and Techniques 32
Mining Frequent Patterns Without Candidate Generation n Grow long patterns from short ones using local frequent items n “abc” is a frequent pattern n Get all transactions having “abc”: DB|abc n “d” is a local frequent item in DB|abc abcd is a frequent pattern 10 March 2021 Data Mining: Concepts and Techniques 33
Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1 -itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 10 March 2021 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 m: 2 b: 1 p: 2 m: 1 34
Benefits of the FP-tree Structure n n Completeness n Preserve complete information for frequent pattern mining n Never break a long pattern of any transaction Compactness n Reduce irrelevant info—infrequent items are gone n Items in frequency descending order: the more frequently occurring, the more likely to be shared n Never be larger than the original database (not count node-links and the count field) n For Connect-4 DB, compression ratio could be over 100 10 March 2021 Data Mining: Concepts and Techniques 35
Partition Patterns and Databases n n Frequent patterns can be partitioned into subsets according to f-list n F-list=f-c-a-b-m-p n Patterns containing p n Patterns having m but no p n … n Patterns having c but no a nor b, m, p n Pattern f Completeness and non-redundency 10 March 2021 Data Mining: Concepts and Techniques 36
Find Patterns Having P From P-conditional Database n n n Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 10 March 2021 f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 Conditional pattern bases item cond. pattern base c f: 3 a fc: 3 b fca: 1, f: 1, c: 1 m: 2 b: 1 m fca: 2, fcab: 1 p: 2 m: 1 p fcam: 2, cb: 1 Data Mining: Concepts and Techniques 37
From Conditional Pattern-bases to Conditional FP-trees n For each pattern-base n Accumulate the count for each item in the base n Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 10 March 2021 m-conditional pattern base: fca: 2, fcab: 1 {} f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 {} p: 1 f: 3 m: 2 b: 1 c: 3 p: 2 m: 1 a: 3 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree Data Mining: Concepts and Techniques 38
Recursion: Mining Each Conditional FP-tree {} {} Cond. pattern base of “am”: (fc: 3) c: 3 f: 3 c: 3 a: 3 f: 3 am-conditional FP-tree Cond. pattern base of “cm”: (f: 3) {} f: 3 m-conditional FP-tree cm-conditional FP-tree {} Cond. pattern base of “cam”: (f: 3) f: 3 cam-conditional FP-tree 10 March 2021 Data Mining: Concepts and Techniques 39
A Special Case: Single Prefix Path in FP-tree n n {} a 1: n 1 a 2: n 2 Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts n n Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a 3: n 3 b 1: m 1 C 2: k 2 10 March 2021 r 1 {} C 1: k 1 C 3: k 3 r 1 = a 1: n 1 a 2: n 2 + a 3: n 3 Data Mining: Concepts and Techniques b 1: m 1 C 2: k 2 C 1: k 1 C 3: k 3 40
Mining Frequent Patterns With FP-trees n n Idea: Frequent pattern growth n Recursively grow frequent patterns by pattern and database partition Method n For each frequent item, construct its conditional pattern -base, and then its conditional FP-tree n Repeat the process on each newly created conditional FP-tree n Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 10 March 2021 Data Mining: Concepts and Techniques 41
Scaling FP-growth by DB Projection n n FP-tree cannot fit in memory? —DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques n Parallel projection is space costly 10 March 2021 Data Mining: Concepts and Techniques 42
Partition-based Projection n n Parallel projection needs a lot of disk space Partition projection saves it p-proj DB fcam cb fcamp fcabm fb cbp fcamp m-proj DB b-proj DB fcab fca am-proj DB fc fc fc 10 March 2021 Tran. DB f cb … a-proj DB fc … cm-proj DB f f f c-proj DB f … f-proj DB … … Data Mining: Concepts and Techniques 43
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T 25 I 20 D 10 K 10 March 2021 Data Mining: Concepts and Techniques 44
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T 25 I 20 D 100 K 10 March 2021 Data Mining: Concepts and Techniques 45
Why Is FP-Growth the Winner? n Divide-and-conquer: n n n decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors n no candidate generation, no candidate test n compressed database: FP-tree structure n no repeated scan of entire database n basic ops—counting local freq items and building sub FP-tree, no pattern search and matching 10 March 2021 Data Mining: Concepts and Techniques 46
Implications of the Methodology n Mining closed frequent itemsets and max-patterns n n Mining sequential patterns n n Free. Span (KDD’ 00), Prefix. Span (ICDE’ 01) Constraint-based mining of frequent patterns n n CLOSET (DMKD’ 00) Convertible constraints (KDD’ 00, ICDE’ 01) Computing iceberg data cubes with complex measures n H-tree and H-cubing algorithm (SIGMOD’ 01) 10 March 2021 Data Mining: Concepts and Techniques 47
Max. Miner: Mining Max-patterns n 1 st scan: find frequent items n n A, B, C, D, E 2 nd scan: find support for n AB, AC, AD, AE, ABCDE n BC, BD, BE, BCDE n CD, CE, CDE, Tid Items 10 A, B, C, D, E 20 B, C, D, E, 30 A, C, D, F Potential maxpatterns Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’ 98 10 March 2021 Data Mining: Concepts and Techniques 48
Mining Frequent Closed Patterns: CLOSET n Flist: list of all frequent items in support ascending order n n n Min_sup=2 Divide search space n Patterns having d but no a, etc. Find frequent closed pattern recursively n n Flist: d-a-f-e-c TID 10 20 30 40 50 Items a, c, d, e, f a, b, e c, e, f a, c, d, f c, e, f Every transaction having d also has cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. 10 March 2021 Data Mining: Concepts and Techniques 49
CLOSET+: Mining Closed Itemsets by Pattern-Growth n n n Itemset merging: if Y appears in every occurrence of X, then Y is merged with X Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned Hybrid tree projection n Bottom-up physical tree-projection n Top-down pseudo tree-projection Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels Efficient subset checking 10 March 2021 Data Mining: Concepts and Techniques 50
CHARM: Mining by Exploring Vertical Data Format n Vertical format: t(AB) = {T 11, T 25, …} n n tid-list: list of trans. -ids containing an itemset Deriving closed patterns based on vertical intersections n t(X) = t(Y): X and Y always happen together n t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining n Only keep track of differences of tids n t(X) = {T 1, T 2, T 3}, t(XY) = {T 1, T 3} n Diffset (XY, X) = {T 2} Eclat/Max. Eclat (Zaki et al. @KDD’ 97), VIPER(P. Shenoy et al. @SIGMOD’ 00), CHARM (Zaki & Hsiao@SDM’ 02) 10 March 2021 Data Mining: Concepts and Techniques 51
Further Improvements of Mining Methods n n AFOPT (Liu, et al. @ KDD’ 03) n A “push-right” method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. @ KDD’ 03) n Mine data sets with small rows but numerous columns n Construct a row-enumeration tree for efficient mining 10 March 2021 Data Mining: Concepts and Techniques 52
Visualization of Association Rules: Plane Graph 10 March 2021 Data Mining: Concepts and Techniques 53
Visualization of Association Rules: Rule Graph 10 March 2021 Data Mining: Concepts and Techniques 54
Visualization of Association Rules (SGI/Mine. Set 3. 0) 10 March 2021 Data Mining: Concepts and Techniques 55
Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 56
Mining Various Kinds of Association Rules n Mining multilevel association n Miming multidimensional association n Mining quantitative association n Mining interesting correlation patterns 10 March 2021 Data Mining: Concepts and Techniques 57
Mining Multiple-Level Association Rules n n n Items often form hierarchies Flexible support settings n Items at the lower level are expected to have lower support Exploration of shared multi-level mining (Agrawal & Srikant@VLB’ 95, Han & Fu@VLDB’ 95) reduced support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% 10 March 2021 Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Data Mining: Concepts and Techniques Level 1 min_sup = 5% Level 2 min_sup = 3% 58
Multi-level Association: Redundancy Filtering n n Some rules may be redundant due to “ancestor” relationships between items. Example n milk wheat bread n 2% milk wheat bread [support = 2%, confidence = 72%] [support = 8%, confidence = 70%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 10 March 2021 Data Mining: Concepts and Techniques 59
Mining Multi-Dimensional Association n Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) n Multi-dimensional rules: 2 dimensions or predicates n Inter-dimension assoc. rules (no repeated predicates) age(X, ” 19 -25”) occupation(X, “student”) buys(X, “coke”) n hybrid-dimension assoc. rules (repeated predicates) age(X, ” 19 -25”) buys(X, “popcorn”) buys(X, “coke”) n n Categorical Attributes: finite number of possible values, no ordering among values—data cube approach Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches 10 March 2021 Data Mining: Concepts and Techniques 60
Mining Quantitative Associations n 1. 2. 3. Techniques can be categorized by how numerical attributes, such as age or salary are treated Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e. g. , Agrawal & Srikant@SIGMOD 96) Clustering: Distance-based association (e. g. , Yang & Miller@SIGMOD 97) n 4. one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD 99) Sex = female => Wage: mean=$7/hr (overall mean = $9) 10 March 2021 Data Mining: Concepts and Techniques 61
Static Discretization of Quantitative Attributes n Discretized prior to mining using concept hierarchy. n Numeric values are replaced by ranges. n In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. n Data cube is well suited for mining. n The cells of an n-dimensional () (age) (income) (buys) cuboid correspond to the predicate sets. n Mining from data cubes can be much faster. 10 March 2021 (age, income) (age, buys) (income, buys) (age, income, buys) Data Mining: Concepts and Techniques 62
Quantitative Association Rules n n n Proposed by Lent, Swami and Widom ICDE’ 97 Numeric attributes are dynamically discretized n Such that the confidence or compactness of the rules mined is maximized 2 -D quantitative association rules: Aquan 1 Aquan 2 Acat Cluster adjacent association rules to form general rules using a 2 -D grid Example age(X, ” 34 -35”) income(X, ” 30 -50 K”) buys(X, ”high resolution TV”) 10 March 2021 Data Mining: Concepts and Techniques 63
Mining Other Interesting Patterns n Flexible support constraints (Wang et al. @ VLDB’ 02) n n n Some items (e. g. , diamond) may occur rarely but are valuable Customized supmin specification and application Top-K closed frequent patterns (Han, et al. @ ICDM’ 02) n n Hard to specify supmin, but top-k with lengthmin is more desirable Dynamically raise supmin in FP-tree construction and mining, and select most promising path to mine 10 March 2021 Data Mining: Concepts and Techniques 64
Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods n Mining various kinds of association rules n From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 65
Interestingness Measure: Correlations (Lift) n play basketball eat cereal [40%, 66. 7%] is misleading n n The overall % of students eating cereal is 75% > 66. 7%. play basketball not eat cereal [20%, 33. 3%] is more accurate, although with lower support and confidence n Measure of dependent/correlated events: lift 10 March 2021 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col. ) 3000 2000 5000 Data Mining: Concepts and Techniques 66
Are lift and 2 Good Measures of Correlation? n “Buy walnuts buy milk [1%, 80%]” is misleading n if 85% of customers buy milk n Support and confidence are not good to represent correlations n So many interestingness measures? (Tan, Kumar, Sritastava @KDD’ 02) 10 March 2021 Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~c Sum(col. ) m ~m all-conf coh 2 9. 26 0. 91 0. 83 9055 100, 000 8. 44 0. 09 0. 05 670 10000 100, 000 9. 18 0. 09 8172 1000 1 0. 5 0. 33 0 DB m, c ~m, c m~c ~m~c lift A 1 1000 100 10, 000 A 2 1000 A 3 1000 100 A 4 1000 Data Mining: Concepts and Techniques 67
Which Measures Should Be Used? n n lift and 2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures (Omiecinski@TKDE’ 03) n n Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’ 03 sub) 10 March 2021 Data Mining: Concepts and Techniques 68
Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods n Mining various kinds of association rules n From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 69
Constraint-based (Query-Directed) Mining n Finding all the patterns in a database autonomously? — unrealistic! n n Data mining should be an interactive process n n The patterns could be too many but not focused! User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining n n User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining 10 March 2021 Data Mining: Concepts and Techniques 70
Constraints in Data Mining n n n Knowledge type constraint: n classification, association, etc. Data constraint — using SQL-like queries n find product pairs sold together in stores in Chicago in Dec. ’ 02 Dimension/level constraint n in relevance to region, price, brand, customer category Rule (or pattern) constraint n small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint n strong rules: min_support 3%, min_confidence 60% 10 March 2021 Data Mining: Concepts and Techniques 71
Constrained Mining vs. Constraint-Based Search n n Constrained mining vs. constraint-based search/reasoning n Both are aimed at reducing search space n Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI n Constraint-pushing vs. heuristic search n It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS n Database query processing requires to find all n Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing 10 March 2021 Data Mining: Concepts and Techniques 72
Anti-Monotonicity in Constraint Pushing TDB (min_sup=2) n Anti-monotonicity n n When an intemset S violates the constraint, so does any of its superset sum(S. Price) v is anti-monotone sum(S. Price) v is not anti-monotone Example. C: range(S. profit) 15 is antimonotone TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 n Itemset ab violates C d 10 n So does every superset of ab e -30 f 30 g 20 h -10 10 March 2021 Data Mining: Concepts and Techniques 73
Monotonicity for Constraint Pushing TDB (min_sup=2) n Monotonicity n n When an intemset S satisfies the constraint, so does any of its superset sum(S. Price) v is monotone min(S. Price) v is monotone Example. C: range(S. profit) 15 n Itemset ab satisfies C n So does every superset of ab 10 March 2021 Data Mining: Concepts and Techniques TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 74
Succinctness n Succinctness: n n n Given A 1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A 1 , i. e. , S contains a subset belonging to A 1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items n min(S. Price) v is succinct n sum(S. Price) v is not succinct Optimization: If C is succinct, C is pre-counting pushable 10 March 2021 Data Mining: Concepts and Techniques 75
The Apriori Algorithm — Example Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 10 March 2021 Scan D L 3 Data Mining: Concepts and Techniques 76
Naïve Algorithm: Apriori + Constraint Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 Scan D L 3 Constraint: Sum{S. price} < 5 10 March 2021 Data Mining: Concepts and Techniques 77
The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D L 1 C 1 Scan D C 2 Scan D L 2 C 3 Scan D L 3 Constraint: Sum{S. price} < 5 10 March 2021 Data Mining: Concepts and Techniques 78
The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D L 1 C 1 Scan D C 2 Scan D L 2 not immediately to be used C 3 Scan D L 3 Constraint: min{S. price } <= 1 10 March 2021 Data Mining: Concepts and Techniques 79
Converting “Tough” Constraints TDB (min_sup=2) n n Convert tough constraints into antimonotone or monotone by properly ordering items Examine C: avg(S. profit) 25 n Order items in value-descending order n n <a, f, g, d, b, h, c, e> If an itemset afb violates C TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b 0 c -20 d 10 -30 n So does afbh, afb* e f 30 n It becomes anti-monotone! g 20 h -10 10 March 2021 Data Mining: Concepts and Techniques 80
Strongly Convertible Constraints n avg(X) 25 is convertible anti-monotone w. r. t. item value descending order R: <a, f, g, d, b, h, c, e> n If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd n n avg(X) 25 is convertible monotone w. r. t. item value ascending order R-1: <e, c, h, b, d, g, f, a> n If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X) 25 is strongly convertible 10 March 2021 Data Mining: Concepts and Techniques Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 81
Can Apriori Handle Convertible Constraint? n n A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm n Within the level wise framework, no direct pruning based on the constraint can be made n Itemset df violates constraint C: avg(X)>=25 n Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework! 10 March 2021 Data Mining: Concepts and Techniques Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10 82
Mining With Convertible Constraints n n Item Value C: avg(X) >= 25, min_sup=2 a 40 List items in every transaction in value descending order R: <a, f, g, d, b, h, c, e> f 30 g 20 d 10 b 0 h -10 c -20 e -30 n n C is convertible anti-monotone w. r. t. R Scan TDB once n remove infrequent items n n n Item h is dropped Itemsets a and f are good, … Projection-based mining n n Imposing an appropriate order on item projection Many tough constraints can be converted into (anti)-monotone 10 March 2021 Data Mining: Concepts and Techniques TDB (min_sup=2) TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e 83
Handling Multiple Constraints n n n Different constraints may require different or even conflicting item-ordering If there exists an order R s. t. both C 1 and C 2 are convertible w. r. t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items n n Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database 10 March 2021 Data Mining: Concepts and Techniques 84
What Constraints Are Convertible? Constraint Convertible antimonotone Convertible monotone Strongly convertible avg(S) , v Yes Yes median(S) , v Yes Yes sum(S) v (items could be of any value, v 0) Yes No No sum(S) v (items could be of any value, v 0) No Yes No sum(S) v (items could be of any value, v 0) Yes No No …… 10 March 2021 Data Mining: Concepts and Techniques 85
Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct v S no yes yes S V yes no yes min(S) v yes no yes max(S) v no yes count(S) v yes no weakly count(S) v no yes weakly sum(S) v ( a S, a 0 ) yes no no sum(S) v ( a S, a 0 ) no yes no range(S) v yes no no range(S) v no yes no avg(S) v, { , , } convertible no support(S) yes no no support(S) no yes no 10 March 2021 Data Mining: Concepts and Techniques 86
A Classification of Constraints Monotone Antimonotone Succinct Strongly convertible Convertible anti-monotone Convertible monotone Inconvertible 10 March 2021 Data Mining: Concepts and Techniques 87
Chapter 5: Mining Frequent Patterns, Association and Correlations n n Basic concepts and a road map Efficient and scalable frequent itemset mining methods n Mining various kinds of association rules n From association mining to correlation analysis n Constraint-based association mining n Summary 10 March 2021 Data Mining: Concepts and Techniques 88
Frequent-Pattern Mining: Summary n Frequent pattern mining—an important task in data mining n Scalable frequent pattern mining methods n Apriori (Candidate generation & test) n Projection-based (FPgrowth, CLOSET+, . . . ) n Vertical format approach (CHARM, . . . ) § Mining a variety of rules and interesting patterns § Constraint-based mining § Mining sequential and structured patterns § Extensions and applications 10 March 2021 Data Mining: Concepts and Techniques 89
Frequent-Pattern Mining: Research Problems n Mining fault-tolerant frequent, sequential and structured patterns n n Mining truly interesting patterns n n Patterns allows limited faults (insertion, deletion, mutation) Surprising, novel, concise, … Application exploration n n E. g. , DNA sequence analysis and bio-pattern classification “Invisible” data mining 10 March 2021 Data Mining: Concepts and Techniques 90
- Frequent pattern
- Closed pattern and max pattern
- Closed frequent pattern
- Closed frequent pattern
- L
- Penyebab penyakit benign neoplasm of breast d24
- Apriori algorithm
- Effective frequent-shopper programs are transparent.
- Acebf
- Associations and correlations in data mining
- The most common prefixes
- Associations and correlations in data mining
- Reverse image search
- Frequent earthquakes in an area may indicate
- Dbminer
- Mining frequent patterns without candidate generation
- Finding frequent items in data streams
- Tornado cupcake cake
- Negative prefixes literate
- Maj. buang-ly
- Mining frequent patterns associations and correlations
- Laag frequent geluid
- Subset operation using hash tree
- Milady hair removal review questions
- Spiral pattern with 2 deltals and many forms of pattern.
- Pattern and pattern classes in image processing
- Narration similarities to other patterns of written text
- Pattern development
- Stands for bloodstain pattern analysis
- Blood pattern analysis
- Serology
- Forensic science weebly
- Passive bloodstain definition
- Projected pattern blood
- Multivariate pattern analysis
- Projected pattern blood
- Tools of structured analysis
- Autohotkey obfuscator
- Difference between content and discourse analysis
- Difference between error analysis and contrastive analysis
- Types of intralingual errors
- Fact-finding techniques in system analysis and design
- Content analysis and task analysis
- Transactional analysis criticism
- Sad vs ooad
- Syntax analyzer source code in java
- Sources of content analysis
- Feasibility in system analysis and design
- Kmo test
- Msfd metasploit
- Hanging indention headline example in newspaper
- Sv io do sentence pattern examples
- Within word pattern spelling stage
- Aspect ratio test
- It is a poetic foot that has a pattern of weak syllable
- The unique pattern of characteristic thoughts
- Dispersed nucleated and linear settlements
- Where do polar and tropical air masses develop
- Cone waxing technique
- Molten press method of wax pattern
- Wax pattern in dentistry
- Visitor pattern uml
- Stand verb pattern
- Example of verb pattern
- Verb + object + gerund
- Ventura selpa
- Vegetative system
- Mary likes hiking swimming and cycling
- Pattern using line
- Urban road pattern
- Svoc sentence pattern examples with answers
- Emphatic pattern
- Pattern pieces
- Joy boyce
- Designs may also be transferred using a tracing wheel
- Perpetrator pattern based approach
- Role object pattern
- Example of iambs
- Nichd category
- Eliciting effect example
- Archetypal pattern in beowulf
- S lv pn examples
- The accumulator pattern
- Functionalisms
- Sentence patterns
- Design patterns strategy
- Strategy pattern sequence diagram
- Pattern of behavior
- Heroic pattern