CEDI 2005 Taller de Minera de Datos Association

  • Slides: 45
Download presentation
CEDI’ 2005 Taller de Minería de Datos Association Rules: Algorithms, variations, extensions, and applications

CEDI’ 2005 Taller de Minería de Datos Association Rules: Algorithms, variations, extensions, and applications Fernando Berzal fberzal@decsai. ugr. es Intelligent Databases and Information Systems research group Department of Computer Science and Artificial Intelligence E. T. S Ingeniería Informática – Universidad de Granada (Spain)

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association mining searches for interesting

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association mining searches for interesting relationships among items in a given data set EXAMPLES n Diapers and six-packs are bought together, specially on Thursday evening (a myth? ) n A sequence such as buying first a digital camera and then a memory card is a frequent (sequential) pattern n … 1

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR MARKET BASKET ANALYSIS The earliest

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR MARKET BASKET ANALYSIS The earliest form of association rule mining Applications: 2 Catalog design, store layout, cross-marketing…

Definition Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Item n In transactional

Definition Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Item n In transactional databases: Any of the items included in a transaction. n In relational databases: (Attribute, value) pair k-itemset Set of k items Itemset support(I) = P(I) 3

Definition Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association rule X Y

Definition Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association rule X Y n Support support(X Y) = support(XUY) = P(XUY) n Confidence confidence(X Y) = support(XUY) / support(X) = P(Y|X) 4 NOTE: Both support and confidence are relative

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association rule mining 1.

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Association rule mining 1. Find all frequent itemsets 2. Generate strong association rules from the frequent itemsets Strong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold. 5

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori Agrawal & Skirant:

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 Observation: All non-empty subsets of a frequent itemset must also be frequent Algorithm: Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i. e. 6 candidates)

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori improvements (I) n

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori improvements (I) n Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based Algorithm for Mining Association Rules", SIGMOD'95 n Sampling Toivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules with Adjustable Accuracy", CIKM'97 n Partitioning Savasere, Omiecinski & Navathe: "An Efficient Algorithm for Mining Association Rules in Large Databases", Databases" VLDB'95 7

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori improvements (II) n

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori improvements (II) n Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 (Apriori. TID) n Dynamic itemset counting Brin, Motwani, Ullman & Tsur: "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD'97 (DIC) Hidber: "Online Association Rule Mining", SIGMOD'99 (CARMA) 8

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori-like algorithm: TBAR (Tree-based

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Apriori-like algorithm: TBAR (Tree-based association rule mining) Berzal, Cubero, Sánchez & Serrano “TBAR: An efficient method for association rule mining in relational databases” Data & Knowledge Engineering, 2001 9

Discovery: TBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR L 1 7

Discovery: TBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR L 1 7 instances wih A 6 instances with L AB 2 A #7 B #9 C #7 D #8 B #6 D #5 C #6 D #7 5 instances with AD 5 instances L 3 with ABD D #5 6 instances with BC 10

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR An alternative to Apriori:

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR An alternative to Apriori: Compress the database representing frequent items into a frequent-pattern tree (FP-tree)… Han, Pei & Yin: "Mining Frequent Patterns without Candidate Generation", SIGMOD'2000 11

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR A challenge When an

Discovery Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR A challenge When an itemset is frequent, all its subsets are also frequent n Closed itemset C: There exists no proper super-itemset S such that support(S)=support(C) n Maximal (frequent) itemset M: M is frequent and there exists no super-itemset Y such that M Y and Y is frequent. 12

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on the kinds

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on the kinds of patterns to be mined: n n n Frequent itemset mining (transactional and relational data) Sequential pattern mining (sequence data sets, e. g. bioinformatics) Structured pattern mining (structured data, e. g. graphs) 13

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on the types

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on the types of values handled: n Boolean association rules n Quantitative association rules n Fuzzy association rules Delgado, Marín, Sánchez & Vila “Fuzzy association rules: General model and applications” IEEE Transactions on Fuzzy Systems, 2003 14

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR More options: n n

Variations Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR More options: n n n I M I n t F orules Generalized association M p serules) (a. k. a. multilevel association D o IC ksh tem r I o Constraint-basednassociation rule mining t W ue ing ons i t n q i a e Incremental algorithms t r M n F e m e Top-k algorithmsl p m I … 15

Visualization Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Integrated into data mining

Visualization Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Integrated into data mining tools to help users understand data mining results: n n n Table-based approach e. g. SAS Enterprise Miner, DBMiner… 2 D Matrix-based approach e. g. SGI Mine. Set, DBMiner… Graph-based techniques e. g. DBMiner ball graphs 16

Visualization: Tables Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 17

Visualization: Tables Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 17

Visualization: Visual aids Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 18

Visualization: Visual aids Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 18

Visualization: 2 D Matrix Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 19

Visualization: 2 D Matrix Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 19

Visualization: Graphs Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 20

Visualization: Graphs Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR 20

Visualization: Vis. AR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on

Visualization: Vis. AR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Based on parallel coordinates (Techapichetvanich & Datta, ADMA’ 2005) 21

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Confidence is not the

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Confidence is not the best possible interestingness measure for rules e. g. A very frequent item will always appear in rule consequents, regardless its true relationship with the rule antecedent X went to war X did not serve in Vietnam (from the US Census) 22

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Desirable properties for interestingness

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Desirable properties for interestingness measures Piatetsky-Shapiro, 1991 P 1 ACC(A⇒C) = 0 when supp(A⇒C) = supp(A)supp(C) P 2 ACC(A⇒C) monotonically increases with supp(A⇒C) P 3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C)) 23

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Certainty factors… n …

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Certainty factors… n … satisfy Piatetsky-Shapiro’s properties n … are widely-used in expert systems n … are not symmetric (as interest/lift) n … can substitute conviction when CF>0 Berzal, Blanco, Sánchez & Vila: “Measuring the accuracy and interest of association rules: A new framework", Intelligent Data Analysis, 2002 24

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR References: Hilderman & Hamilton:

Extensions Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR References: Hilderman & Hamilton: “Evaluation of interestingness measures for ranking discovered knowledge” PAKDD, 2001 Tan, Kumar & Srivastava: “Selecting the right objective measure for association analysis” Information Systems, vol. 29, pp. 293 -313, 2004. Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification purposes” TAMIDA’ 2005 25

Applications Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Two sample applications where

Applications Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Two sample applications where associations rules have been successful n Classification (ART) Berzal, Cubero, Sánchez & Serrano “ART: A hybrid classification model” Machine Learning Journal, 2004 n Anomaly detection (ATBAR) Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association Rules ” KDD’ 2005, Chicago, Illinois, USA 26

Classification Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Classification models based on

Classification Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Classification models based on association rules n Partial classification models vg: Bayardo n “Associative” classification models vg: CBA (Liu et al. ) n Bayesian classifiers vg: LB (Meretakis et al. ) n Emergent patterns vg: CAEP (Dong et al. ) n Rule trees vg: Wang et al. n Rules with exceptions vg: Liu et al. 27

Classification Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR GOAL Simple, intelligible, and

Classification Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR GOAL Simple, intelligible, and robust classification models obtained in an efficient and scalable way MEANS Decision Tree Induction + Association Rule Mining = ART [Association Rule Trees] 28

ART Classification Model Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR IDEA Make

ART Classification Model Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR IDEA Make use of efficient association rule mining algorithms to build a decision-tree-shaped classification model. ART = Association Rule Tree KEY Association rules + “else” branches Hybrid between decision trees and decision lists 29

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR ART Classification Model SPLIC E

Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR ART Classification Model SPLIC E 30

ART classification model Example Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR ART

ART classification model Example Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR ART vs. TDIDT 40

ART classification model Final comments Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR

ART classification model Final comments Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Classification models n Acceptable accuracy n Reduced complexity n Attribute interactions n Robustness (noise & primary keys) Classifier building method n Efficient algorithm n Good scalability properties n Automatic parameter selection 47

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR It is often

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR It is often more interesting to find surprising non -frequent events than frequent ones EXAMPLES n Abnormal network activity patterns in intrusion detection systems. n Exceptions to “common” rules in Medicine (useful for diagnosis, drug evaluation, detection of conflicting therapies…) n … 48

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association rule

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association rule Confident rule representing homogeneous deviations from common behavior. 49

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR X usually implies

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR X usually implies Y (dominant rule) X Y frequent and confident When X does not imply Y, then it usually implies A (the Anomaly) X ¬Y A confident Anomalous association rule X Y ¬A confident 50

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR X Y A

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR X Y A 1 Z 1 … X Y A 1 Z 2 … X Y A 2 Z 3 … X Y A 2 Z 1 … X Y A 3 Z 2 … X Y A 3 Z 3 … X Y A Z … X Y 3 A Z 3 … X Y 3 A Z … X Y 4 A Z … X Y is the dominant rule X A when ¬ Y is the anomalous rule 51

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Suzuki et al.

Anomaly detection Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Suzuki et al. ’s “Exception Rules” X Y is an association rule X I ¬ Y is the exception rule I is the “interacting” itemset X I is the reference rule û Too many exceptions û The “cause” needs to be present 52

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association rules First scan A #7 B #9 C #7 D #8 Second scan AA#7 #7 AAB#6 * AC#4 AD#5 AE#3 AF#3 B #6 D #5 Non-frequent 53

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association rules First scan A #7 B #9 C #7 D #8 Second scan A #7 A* B #9 B* B #6 D #5 C #7 C* C #6 D #7 D #8 D* D #5 54

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association

Anomaly detection: ATBAR Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Anomalous association rules Rule generation is immediate from the frequent and extended itemsets obtained by ATBAR 55

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Experiments on

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR Experiments on health-related datasets from the UCI Machine Learning Repository n n Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules) Reasonable overhead needed to obtain anomalous association rules (about 20% in ATBAR w. r. t. TBAR) 56

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR An example

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR An example from the Census dataset: “Anomaly” if WORKCLASS: Local-gov then CAPGAIN: [99999. 0 , 99999. 0] (7 out of 7) when not CAPGAIN: [0. 0 , 20051. 0] Usual consequent 57

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR n n

Anomaly detection: Results Motivation Definition Discovery Variations Visualization Extensions Applications ART ATBAR n n n Anomalous association rules (novel characterization of potentially interesting knowledge) An efficient algorithm for discovering anomalous association rules: ATBAR Some heuristics for filtering the discovered anomalous association rules 58

CEDI’ 2005 Taller de Minería de Datos Association Rules: Algorithms, variations, extensions, and applications

CEDI’ 2005 Taller de Minería de Datos Association Rules: Algorithms, variations, extensions, and applications Questions, comments, and suggestions… Fernando Berzal fberzal@decsai. ugr. es Intelligent Databases and Information Systems research group Department of Computer Science and Artificial Intelligence E. T. S Ingeniería Informática – Universidad de Granada (Spain)