Integrating Classification and Association Rule Mining Bing Liu
Integrating Classification and Association Rule Mining Bing Liu, Wynne Hsu, Yiming Ma Presented By: Salil Kulkarni Muhammad Talha
Introduction n Classification rule mining: – Discover a small set of rules in the database to form an accurate classifier – There is a pre-determined target n Association rule mining: – Find all rules in the database that satisfy some minimum support and minimum confidence – No pre-determined target
Associative Classification Framework n n Aims to integrate the above data mining techniques efficiently while preserving the accuracy of the classifier How? – The algorithm focuses only on those association rules whose right hand side is restricted to the classification class attribute also referred to as CARs (class association rules) – CARs are generated based Apriori algorithm – Thus data mining in this framework involves three steps: • Discretizing continuous attributes • Generating all the CARs • Building a classifier based on the CARs
Contributions of the Framework n n n New way to build accurate classifiers Applies association rule mining to classification tasks Solve the understandability problem Helps discover rules not discovered by existing classification systems No need to load the database into memory
Definitions n Dataset D : - A normal relational table with N cases described by l distinct attributes. – Attributes can be continuous or categorical n n An item : - (attribute, integer-value) pair Datacase d: - Set of items with a class label Let I be the set of all items in D Let Y be the set of all class labels in D
Definitions [contd. ] n n A datacase d contains X(a set of items), where X is a subset of I, if X is also a subset of d CAR : Class association rule is an implication of the form X -> y, where X is a subset of I, and y belongs to Y
CBA-RG Algorithm (Basic Concepts) n n n Finds all ruleitems that satisfy the specified minsup condition A ruleitem is represented as <condset, y>, where condset is set of items, and y is the class label k-ruleitem is a ruleitem that has k items in its condset Support count of the condset(condsup. Count) is the number of cases in dataset that contain the condset Support count of the ruleitem(rulsup. Count) will be the number of cases that contain condset and are labeled with class y
CBA-RG Algorithm (Basic Concepts)[contd. ] n n Support(ruleitem) = (condsup. Count/|D|) * 100; Confidence(ruleitem) = (rulsup. Count/condsup. Count) * 100; For all ruleitems that have the same condset, the ruleitem with the highest confidence is chosen as the possible rule (PR), incase of a tie, a ruleitem is selected at random Each frequent ruleitem from the set of frequent ruleitems is of the form <(condset, condsup. Count), (y, rulesup. Count)>
CBA-RG Algorithm n n n The algorithm finds all frequent 1 -ruleitems denoted by F 1 From F 1, the function gen. Rules(F 1) generates CAR 1 is subjected to pruning, however pruning is optional While Fk-1 is non-empty, it generates candidates C using the candidate. Gen(Fk-1 ) using the rules Fk-1 generated in the k-1 th pass over the data Then the algorithm scans the database, and updates the support counts or condsup. Count of the ruleitems. It also updates the rulsup. Count if the class of the data case matches the class of the ruleitem
1 F 1 = {large 1 -ruleitems}; 2 CAR 1 = gen. Rules(F 1); 3 pr. CAR 1 = prune. Rules(CAR 1); 4 for (k = 2; Fk-1 != ; k++) do 5 Ck = candidate. Gen(Fk-1); 6 for each data case d D do 7 Cd = rule. Subset(Ck, d); 8 for each candidate c Cd do 9 c. condsup. Count++; 10 if d. class = c. class then c. rulesup. Count++ 11 end 12 end 13 Fk = {c Ck | c. rulesup. Count minsup}; 14 CARk = gen. Rules(Fk); 15 pr. CARk = prune. Rules(CARk); 16 end 17 CARs = Uk CARk; 18 pr. CARs = U pr. CARk; CBA-RG Algorithm
Building the Classifier n Definition: Total order on rules Given two rules ri and rj, ri rj also called as ri has higher precedence than rj, if • ri has higher confidence than rj or, • Their confidences are same but the support of ri is greater than rj or • Their confidences and supports are the same, but ri was generated earlier than rj
Building the Classifier n Basic Idea: – Choose a set of high precedence rules from the set of all generated rules to cover the dataset D n Format of Classifier: – <r 1, r 2, …, rn, default_class> where, ri R, ra rb if b > a
Building the Classifier n Stages in the Naïve version of the classifier (M 1): – Stage 1: Sort the rules in R according to precedence relation Purpose: Ensure that rules with the highest precedence are chosen by the classifier – Stage 2: For every r in R in the sorted sequence • Go through every case d in the dataset • If r covers d, i. e. r satisfies the conditions of d, then assign a unique d. id to the datacase
Building the Classifier [contd. ] – Stage 2 contd. : • If r correctly classifies atleast one case, then r is marked as it could be a potential rule in the final classifier. All datacases that are covered by r are then removed from the dataset • Majority class of the remaining training data is selected as default class • Total number of errors are computed for the classifier • Halts when it runs out of rules, or training cases
Building the Classifier[contd. ] n Stage 3: – All rules that fail to improve the accuracy of the classifier are discarded – First rule where least number of errors are recorded acts as a cut-off rule, and rules after this rule are deleted from the classifier – Set of the undiscarded rules, and the default class of the last rule for the classifier
1 R = sort(R); 2 for each rule r R in sequence do 3 temp = ; 4 for each case d D do 5 if d satisfies the conditions of r then 6 store d. id in temp and mark r if it correctly classifies d; 7 if r is marked then 8 insert r at the end of C; 9 delete all the cases with the ids in temp from D; 10 selecting a default class for the current C; 11 compute the total number of errors of C; 12 end 13 end 14 Find the first rule p in C with the lowest total number of errors and drop all the rules after p in C; 15 Add the default class associated with p to end of C, and return C (our classifier).
Building the Classifier[contd. ] n Two main conditions satisfied by the algorithm – Condition 1: Each data case is covered by the highest precedence rule among all the rules that can cover the case – Condition 2: Every rule in C correctly classifies at least one training case
Performance concern n M 1 is a simple algorithm, but it makes multiple passes over the dataset For a large dataset resident on the disk, it may be very inefficient to use M 1 Next, the authors propose a version of the algorithm that takes slightly more than 1 pass over the dataset
CBA-CB M 2 n n n An improved version of M 1 algorithm M 1 makes one pass over the remaining data for each rule M 2 finds the best rule in R to cover each case – Only slightly more than one pass n M 2 consists of three stages
Stage 1 n For each d D – Find two highest precedence rules • c. Rule correctly classifies d • w. Rule wrongly classifies d – U : {set of all c. Rules} – class. Cases. Covered[class] attribute of a Rule: # of cases covered for each class
Stage 1 n n n Update c. Rule. class. Cases. Covered[r/d. class]++ Add c. Rule to U: {set of all c. Rules} If c. Rule w. Rule • Mark d to be covered by c. Rule (Condition 1) • Mark c. Rule to indicate that it classifies d correctly (Condition 2) • Add c. Rule to Q: { set of c. Rules corresponding w. Rules } n Else • Store <d. id, d. class, c. Rule, w. Rule> • Add to A: {set of above data structure}
Stage 2 n n Handle cases d not covered in Stage 1 Second pass over the Database – Only slightly more than one pass n Determine all rules that classify the remaining data cases wrongly with higher precedence than c. Rule of d
Stage 2 n For each <d. id, d. class, c. Rule, w. Rule> A – If w. Rule marked • • • c. Rule of at least one data case (condition 2) Mark d to be covered by w. Rule (condition 1) w. Rule. class. Cases. Covered[d. class]++ c. Rule. class. Cases. Covered[d. class]— Already in Q because it is c. Rule of some case – Else find all rules that classify d wrongly with higher precedence than c. Rule in U : {set of all c. Rules} (scan D) • For each rule w, – Store <d. id, d. class, c. Rule> in w. replace since it may replace c. Rule to cover d – Update w. class. Cases. Covered[d. class]++ – Add w to Q
Stage 3 n n Choose final set of rules for classifier C Step 1: Choose set of potential rules to form classifier – Sort Q according to precedence (condition 1)
Stage 3: Step 1 n For each r Q – Discard any rule r that no longer correctly classifies a case correctly (condition 2) – For each entry r. replace <c. Rule, d. id, d. class> • If d. id covered by previous (higher precedence) rule then r does not replace c. Rule • Else replace c. Rule by r – r. class. Cases. Covered[d. class]++ – c. Rule. class. Cases. Covered[d. class]--
Stage 3 : Step 1 n For each r Q (continued) – Compute rule. Errors • Number of errors made by selected rules so far – Compute default. Class • Majority class in remaining data cases • Compute default. Error – total. Errors= rule. Errors+ default. Error – Insert <r, default. Class, total. Errors > into end of C
Stage 3 : Step 2 n n n Discard rules that introduce more errors after rule p in C with least total. Errors Add default. Class of p to end of C Return final Classifier C without total. Errors
Empirical Evaluation n n CBA and C 4. 5 (tree and rules) classifiers were compared 26 datasets from UCI ML Repository were used. minconf was set to 50% minsup has a strong impact on the accuracy of the classifier – Too high then rules with high confidence may be discarded and CARs fail to cover all cases n It was observed from experiments – minsup of 1 -2% , the classifier built is more accurate than C 4. 5
Empirical Evaluation n In reported experiments, minsup set to 1% Limit rules in memory to 80, 000 Continuous attributes discretized using the Entropy method
Results 26 Datasets C 4. 5 rules Error Rate w/o discret. C 4. 5 rules Error Rate w discret. CBA (CARs+infreq) w/o w pru CBA (CARs) w/o and w pruning # CARs w/o and w pruning Run Time (sec) CBA-RG w w/o pr Run Time (sec) CBA-CB M 1 M 2 # of Rules in Classifier C w pru Averages 16. 7 17. 1 15. 6, 15. 6 15. 7, 15. 8 35140, 2377 6. 35, 6. 44 0. 39, 0. 18 69
Observations n n n CBA superior to C 4. 5 rules for 16 out 26 No difference between rule pruning or without pruning M 2 much more efficient than M 1
Two Important Results n All rules in 16 of the 26 datasets could not be found – due to the 80, 000 limit – the classifiers built are still quite still accurate – When the limit reaches 60, 000 in the 26 datasets, the accuracy of the resulting classifiers starts to stabilize n When CBA run on dataset cases on disk by increasing cases up to 32 times (e. g. , 160, 000 cases). – Experimental results show that both CBA-RG and CBA-CB (M 2) have a linear scaleup.
Related Work n Several researchers tried to build classifiers with extensive search – None use rule mining n CBA-CB related to Michalski – Finds best rule for each class and remove cases – Applied recursively until no cases left – Heuristic search – No encouraging results
Michalski vs. CBA-CB n n n Best rules are local because remove case after it is found Results not good. Local rules over fit data n n Best rules are global because generated using all cases Better results
Conclusion n Presents new framework to construct an accurate classifier based on classification association rules.
- Slides: 35