CENG 464 Introduction to Data Mining Supervised vs

Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations,

Classification: Definition • Given a collection of records (training set ) – Each record

Classification: Definition o l a c ri g te a c 4 ca te

Prediction Problems: Classification vs. Numeric Prediction • Classification : – predicts categorical class labels

Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each

Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) 7 IF rank =

Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor,

Illustrating Classification Task Training and Test set are randomly sampled supervised accuracy 9 Find

Classification Techniques • Decision Tree based Methods • • • 10 Bayes Classification Methods

Example of a Decision Tree o g te ca l a c ri in

Another Example of Decision Tree l ir ca o g c e at go

Decision Tree Classification Task Decision Tree 13

Apply Model to Test Data Start from the root of tree. Refund Yes No

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax.

Decision Tree Classification Task Decision Tree 20

Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm – ID 3, C 4.

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is

Tree Induction • Greedy strategy. – Split the records based on an attribute test

How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal

Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct

Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct

Splitting Based on Continuous Attributes • Different ways of handling – Discretization to form

Splitting Based on Continuous Attributes 28

How to determine the Best Split Before Splitting: 10 records of class 0, 10

How to determine the Best Split • Greedy approach: – Nodes with homogeneous class

Attribute Selection-Splitting Rules Measures (Measures of Node Impurity) Provides a ranking for each attribute

Attribute Selection Measure: Information Gain (ID 3/C 4. 5) Select the attribute with the

Attribute Selection: Information Gain Class P: buys_computer = “yes” g Class N: buys_computer =

Computing Information-Gain for Continuous-Valued Attributes • Let attribute A be a continuous-valued attribute •

Gain Ratio for Attribute Selection (C 4. 5) • Information gain measure is biased

Gini Index (CART, IBM Intelligent. Miner) • If a data set D contains examples

Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes”

Comparing Attribute Selection Measures • The three measures, in general, return good results but

Other Attribute Selection Measures • CHAID: a popular decision tree algorithm, measure based on

Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data

Decision Tree Based Classification • Advantages: – Inexpensive to construct – Extremely fast at

Chapter 8. Classification: Basic Concepts • Decision Tree Induction • Bayes Classification Methods •

Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i. e. , predicts

Bayes’ Theorem: Basics • Total probability Theorem: • Bayes’ Theorem: 45 – Let X

Prediction Based on Bayes’ Theorem • Given training data X, posteriori probability of a

Classification Is to Derive the Maximum Posteriori • Let D be a training set

Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i. e. ,

Naïve Bayes Classifier: Training Dataset Class: C 1: buys_computer = ‘yes’ C 2: buys_computer

Naïve Bayes Classifier: An Example • P(Ci): P(buys_computer = “yes”) = 9/14 = 0.

Avoiding the Zero-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero.

Naïve Bayes Classifier: Comments • Advantages – Easy to implement – Robust to noise

Using IF-THEN Rules for Classification • Represent the knowledge in the form of IF-THEN

Using IF-THEN Rules for Classification • If more than one rule are triggered, need

Rule Extraction from a Decision Tree n n Rules are easier to understand than

Rule Extraction from a Decision Tree age? youth middle_aged student? no no senior credit

Rule Induction: Sequential Covering Method • Sequential covering algorithm: Extracts rules directly from training

Sequential Covering Algorithm When learning a rule for a class, C, we would like

How to Learn-One-Rule? Two approaches: Specialization: • Start with the most general rule possible:

How to Learn-One-Rule? Two approaches: generalization • Start with the randomly selected positive tuple

How to Learn-One-Rule? Rule-Quality measures: used to decide if appending a test to the

How to Learn-One-Rule? Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in

Nearest Neighbour Approach • General Idea – The Model: a set of training examples

Nearest Neighbour Approach • k. NN Classification Algorithm algorithm k. NN (Tr: training set;

Nearest Neighbour Approach • PEBLS Algorithm – – – Class based similarity measure is

Nearest Neighbour Approach • PEBLS: Value Difference Table for Outlook r: is set to

Nearest Neighbour Approach • PEBLS: Distance Function where w. X, w. Y: weights for

Nearest Neighbour Approach • PEBLS: Distance Function (Example) Value Difference Tables Assuming row 1.

Nearest Neighbour Approach • PEBLS: Example 2 2 sunny hot high true N 1

Artificial Neural Network Approach – Our brains are made up of about 100 billion

Artificial Neural Network Approach • General Idea – The Model: A network of connected

Artificial Neural Network Approach • Artificial Neuron (Unit) i 1 i 2 i 3

Artificial Neural Network Approach Ø Ø Ø A neural network can have many hidden

Artificial Neural Network Approach • Artificial Neuron (Perceptron) i 1 i 2 i 3

Artificial Neural Network Approach • General Principle for Training an ANN algorithm train. Network

Artificial Neural Network Approach • Using ANN for Classification – Multiple hidden layers: •

Artificial Neural Network Approach • Network Topology – # of nodes in input layer:

Model Evaluation and Selection • Evaluation metrics: How can we measure accuracy? Other metrics

Classifier Evaluation Metrics: Confusion Matrix: Actual classPredicted class yes no yes True Positives (TP)

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity Class Imbalance Problem: Y TP

Classifier Evaluation Metrics: Precision and Recall, and F-measures • Precision: exactness – what %

Classifier Evaluation Metrics: Example – Actual ClassPredicted class cancer = yes cancer = no

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods • Holdout method – Given data is

Evaluating Classifier Accuracy: Bootstrap • Bootstrap – Works well with small data sets –

Ensemble Methods: Increasing the Accuracy • Ensemble methods – Use a combination of models

Classification of Class-Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative

Model Selection: ROC Curves • • • ROC (Receiver Operating Characteristics) curves: for visual

Issues Affecting Model Selection • Accuracy – classifier accuracy: predicting class label • Speed

Comparison of Techniques • Comparison of Approaches Model Interpretability: ease of understanding classification decisions

Decision Tree Induction in Weka • Overview – – – – ID 3 (only

Decision Tree Induction in Weka • Preparation Pre-processing attributes if necessary Specifying the class

Decision Tree Induction in Weka • Constructing Classification Models (ID 3) 1. Choosing a

Decision Tree Induction in Weka • J 48 (unpruned tree)

Decision Tree Induction in Weka • Random. Tree

Decision Tree Induction in Weka • Classifying Unseen Records 1. Preparing unseen records in

Decision Tree Induction in Weka • Classifying Unseen Records 2. Classifying unseen records in

Decision Tree Induction in Weka • Classifying Unseen Records 3. Saving Classification Results into

Decision Tree Induction in Weka • Classifying Unseen Records 4. Classification Results in an

Comparison of Techniques • Comparison of Performance in Weka – A system module known

Comparison of Techniques • Setting up Experiment in Weka Choosing a Test options Adding

Comparison of Techniques • Experiment Results in Weka Analysi s method Value of significan

Classification in Practice • Process of a Classification Project 1. 2. 3. 4. 5.

Classification in Practice • Data Preparation – – Identify descriptive features (input attributes) Identify

References (1) • • • C. Apte and S. Weiss. Data mining with decision

References (3) • T. -S. Lim, W. -Y. Loh, and Y. -S. Shih. A

References (4) • • • R. Rastogi and K. Shim. Public: A decision tree

Slides: 108

Download presentation

CENG 464 Introduction to Data Mining

Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc. ) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 2

Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 3

Classification: Definition o l a c ri g te a c 4 ca te l a c ri o g us s o u as in l t c n co

Prediction Problems: Classification vs. Numeric Prediction • Classification : – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Numeric Prediction – models continuous-valued functions, i. e. , predicts unknown or missing values • Typical applications – Credit/loan approval: – Medical diagnosis: if a tumor is cancerous or benign – Fraud detection: if a transaction is fraudulent – Web page categorization: which category it is 5

Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set (otherwise overfitting) – If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called validation (test) set 6

Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) 7 IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? 8

Illustrating Classification Task Training and Test set are randomly sampled supervised accuracy 9 Find a mapping OR function that can predict class label of given tuple X

Classification Techniques • Decision Tree based Methods • • • 10 Bayes Classification Methods Rule-based Methods Nearest-Neighbor Classifier Artificial Neural Networks Support Vector Machines Memory based reasoning

Example of a Decision Tree o g te ca l a c ri in us o u t on c s s cla Root node: Internal nodes: attribute test conditions Leaf nodes: class label Splitting Attributes Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Training Data 11 Married NO > 80 K YES Model: Decision Tree

Another Example of Decision Tree l ir ca o g c e at go l ir ca us o u in t ss n a l c co Married Mar. St NO Single, Divorced Refund No Yes NO Tax. Inc < 80 K NO > 80 K YES There could be more than one tree that fits the same data! 12

Decision Tree Classification Task Decision Tree 13

Apply Model to Test Data Start from the root of tree. Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 14 Married NO > 80 K YES

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 15 Married NO > 80 K YES

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 16 Married NO > 80 K YES

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 17 Married NO > 80 K YES

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 18 Married NO > 80 K YES

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO 19 Married NO > 80 K YES Assign Cheat to “No”

Decision Tree Classification Task Decision Tree 20

Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm – ID 3, C 4. 5 – CART – SLIQ, SPRINT 21

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e. g. , information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 22

Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting 23

How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous • Depends on number of ways to split – 2 -way split – Multi-way split 24

Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. Family Car. Type Luxury Sports • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} 25 Car. Type {Family} OR {Family, Luxury} Car. Type {Sports}

Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. Size Small Large Medium • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} • What about this split? 26 OR {Small, Large} {Medium, Large} Size {Medium} {Small}

Splitting Based on Continuous Attributes • Different ways of handling – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A v) • consider all possible splits and finds the best cut • can be more compute intensive 27

Splitting Based on Continuous Attributes 28

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 29

How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: 30 Non-homogeneous, High degree of impurity Low degree of impurity

Attribute Selection-Splitting Rules Measures (Measures of Node Impurity) Provides a ranking for each attribute describing the given training tuples. The attribute having the best score for the measure is chosen as the splitting attribute for the given tuples. • Information Gain-Entropy • Gini Index • Misclassification error 31

Brief Review of Entropy • m=2 32

Attribute Selection Measure: Information Gain (ID 3/C 4. 5) Select the attribute with the highest information gain n This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or impurity in these partitions n Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| n n Expected information (entropy) needed to classify a tuple in D: n Information needed (after using A to split D into v partitions) to classify D: n Information gained by branching on attribute A 33

Attribute Selection: Information Gain Class P: buys_computer = “yes” g Class N: buys_computer = “no” g means “youth” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, 34

Computing Information-Gain for Continuous-Valued Attributes • Let attribute A be a continuous-valued attribute • Must determine the best split point for A – Sort the value A in increasing order – Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 – The point with the minimum expected information requirement for A is selected as the split-point for A • Split: 35 – D 1 is the set of tuples in D satisfying A ≤ split-point, and D 2 is the set of tuples in D satisfying A > split-point

Gain Ratio for Attribute Selection (C 4. 5) • Information gain measure is biased towards attributes with a large number of values • C 4. 5 (a successor of ID 3) uses gain ratio to overcome the problem (normalization to information gain) – Gain. Ratio(A) = Gain(A)/Split. Info(A) • Ex. – gain_ratio(income) = 0. 029/1. 557 = 0. 019 • The attribute with the maximum gain ratio is selected as the splitting attribute 36

Gini Index (CART, IBM Intelligent. Miner) • If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as • Reduction in Impurity: • The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) 37

Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” • Suppose the attribute income partitions D into 10 in D 1: {low, medium} and 4 in D 2 Gini{low, high} is 0. 458; Gini{medium, high} is 0. 450. Thus, split on the {low, medium} (and {high}) since it has the lowest Gini index • All attributes are assumed continuous-valued • May need other tools, e. g. , clustering, to get the possible split values • Can be modified for categorical attributes 38

Comparing Attribute Selection Measures • The three measures, in general, return good results but – Information gain: • biased towards multivalued attributes – Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others – Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions 39

Other Attribute Selection Measures • CHAID: a popular decision tree algorithm, measure based on χ2 test for independence • C-SEP: performs better than info. gain and gini index in certain cases • G-statistic: has a close approximation to χ2 distribution • MDL (Minimal Description Length) principle (i. e. , the simplest solution is preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree • Multivariate splits (partition based on multiple variable combinations) – CART: finds multivariate splits based on a linear comb. of attrs. • Which attribute selection measure is the best? – Most give good results, none is significantly superior than others 40

Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Poor accuracy for unseen samples • Two approaches to avoid overfitting – Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” 41

Decision Tree Based Classification • Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets 42

Chapter 8. Classification: Basic Concepts • Decision Tree Induction • Bayes Classification Methods • Rule-Based Classification • Model Evaluation and Selection • Techniques to Improve Classification Accuracy: Ensemble Methods • Summary 43

Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i. e. , predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 44

Bayes’ Theorem: Basics • Total probability Theorem: • Bayes’ Theorem: 45 – Let X be a data sample (“evidence”): class label is unknown – Let H be a hypothesis that X belongs to class C – Classification is to determine P(H|X), (i. e. , posteriori probability): the probability that the hypothesis holds given the observed data sample X – P(H) (prior probability): the initial probability • E. g. , X will buy computer, regardless of age, income, … – P(X): probability that sample data is observed – P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E. g. , Given that X will buy computer, the prob. that X is 31. . 40, medium income

Prediction Based on Bayes’ Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem • Informally, this can be viewed as posteriori = likelihood x prior/evidence • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 46

Classification Is to Derive the Maximum Posteriori • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x 1, x 2, …, xn) • Suppose there are m classes C 1, C 2, …, Cm. • Classification is to derive the maximum posteriori, i. e. , the maximal P(Ci|X) • This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized 47

Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i. e. , no dependence relation between attributes): • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is 48

Naïve Bayes Classifier: Training Dataset Class: C 1: buys_computer = ‘yes’ C 2: buys_computer = ‘no’ Data to be classified: X = (age =youth, Income = medium, Student = yes Credit_rating = Fair) 49

Naïve Bayes Classifier: An Example • P(Ci): P(buys_computer = “yes”) = 9/14 = 0. 643 P(buys_computer = “no”) = 5/14= 0. 357 • Compute P(X|Ci) for each class P(age = “youth” | buys_computer = “yes”) = 2/9 = 0. 222 P(age = “youth ” | buys_computer = “no”) = 3/5 = 0. 6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0. 444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0. 4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0. 667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0. 2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0. 667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0. 4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0. 222 x 0. 444 x 0. 667 = 0. 044 P(X|buys_computer = “no”) = 0. 6 x 0. 4 x 0. 2 x 0. 4 = 0. 019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0. 028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0. 007 Therefore, X belongs to class (“buys_computer = yes”) 50

Avoiding the Zero-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) • Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 – The “corrected” prob. estimates are close to their “uncorrected” counterparts 51

Naïve Bayes Classifier: Comments • Advantages – Easy to implement – Robust to noise – Can handle null values – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence, therefore loss of accuracy – Practically, dependencies exist among variables • E. g. , hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc. , Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier • How to deal with these dependencies? Bayesian Belief Networks 52

Using IF-THEN Rules for Classification • Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes – Rule antecedent/precondition vs. rule consequent – If rule is satisfied by X, it covers the tupple, the rule is said to be triggered – If R 1 is the rule satisfied, then the rule fires by returning the class predictiın • Assessment of a rule: coverage and accuracy – ncovers = # of tuples covered by R – ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| where D: training data set accuracy(R) = ncorrect / ncovers 54

Using IF-THEN Rules for Classification • If more than one rule are triggered, need conflict resolution – Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i. e. , with the most attribute tests) – Rule ordering: prioritize rules beforehand • Class-based ordering: classes are sorted in order of decreasing importance like order of prevalence or misclassification cost per class. Within each class rules are nor ordered • Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality like accuracy, coverage or size. The first rule satisfying X fires class prediction, any other rule satisfying X is ignored. Each rule in the list implies the negation of the rules that come before it difficult to interpret • What if no rule is fired for X? default rule! 55

Rule Extraction from a Decision Tree n n Rules are easier to understand than large trees One rule is created for each path from the root to a leaf and logically ANDed to form the rule antecedent Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive and exhaustive n Mutually exclusive: no two rules will be triggered for the same tuple n Exhaustive: there is one rule for each possible attribute value combination no need for a default rule 56

Rule Extraction from a Decision Tree age? youth middle_aged student? no no senior credit rating? yes yes excellent no • Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = middle_aged THEN buys_computer = yes IF age = senior AND credit_rating = excellent THEN buys_computer = no IF age = senior AND credit_rating = fair THEN buys_computer = yes 57 fair yes

Rule Induction: Sequential Covering Method • Sequential covering algorithm: Extracts rules directly from training data • Typical sequential covering algorithms: FOIL, AQ, CN 2, RIPPER • Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes • Steps: – Rules are learned one at a time – Each time a rule is learned, the tuples covered by the rules are removed – Repeat the process on the remaining tuples until termination condition, e. g. , when no more training examples or when the quality of a rule returned is below a user-specified threshold • Comp. w. decision-tree induction: learning a set of rules simultaneously 58

Sequential Covering Algorithm When learning a rule for a class, C, we would like the rule to cover all or most of the training tuples of class C and none or few of the tuples from other classes while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples 59

How to Learn-One-Rule? Two approaches: Specialization: • Start with the most general rule possible: empty rule class y • Best attribute-value pair is added from list A into the antecedent • Continue until rule performance measure cannot improve further – If income=high THEN loan_decision=accept – If income=high AND credit_rating=excellent THEN loan_decision=accept – Greedy algorithm: always add attribute –value pair which is best at the moment 60

How to Learn-One-Rule? Two approaches: generalization • Start with the randomly selected positive tuple and converted to a rule that covers – Tuple: (overcast, high, false, P) can be converted to a rule as Outlook=overcast AND humidity=high AND windy=false class=P • Choose one attribute-value pair and remove it sothat rule covers more positive examples • Repeat the process until the rule starts to cover negative examples 61

How to Learn-One-Rule? Rule-Quality measures: used to decide if appending a test to the rule’s condition will result in an improved rule: accuracy, coverage • Consider R 1 correctly classifies 38 0 f 40 tuples whereas R 2 covers 2 tuples and correctly classifies all: which rule is better? Accuracy? • Different Measures: Foil-gain, likelihood ratio statistics, chisquare statistics 62

How to Learn-One-Rule? Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in a better rule • considers both coverage and accuracy – Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition pos and neg are the # of positively and negatively covered tuples by R and Pos’ and neg’ are the # of positively and negatively covered tuples by R’ • favors rules that have high accuracy and cover many positive tuples • No test set for evaluating rules but Rule pruning is performed by removing a condition Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R 63

Nearest Neighbour Approach • General Idea – The Model: a set of training examples stored in memory – Lazy Learning: delaying the decision to the time of classification. In other words, there is no training! – To classify an unseen record: compute its proximity to all training examples and locate 1 or k nearest neighbours examples. The nearest neighbours determine the class of the record (e. g. majority vote) – Rationale: “If it walks like a duck, quacks like a duck, and looks like a duck, it probably is a duck”.

Nearest Neighbour Approach • k. NN Classification Algorithm algorithm k. NN (Tr: training set; k : integer; r : data record) : Class begin for each training example t in Tr do calculate proximity d(t, r) upon descriptive attributes end for; select the top k nearest neighbours into set D accordingly; Class : = majority class in D return Class( ): ■ end;

Nearest Neighbour Approach • PEBLS Algorithm – – – Class based similarity measure is used A nearest neighbour algorithm (k = 1) Examples in memory have weights (exemplars) Simple training: assigning and refining weights A different proximity measure Algorithm outline: 1. Build value difference tables for descriptive attributes (in preparation of measuring distances between examples) 2. For each training, refine the weight of its nearest neighbour 3. Refine the weights of some training examples when classifying validation examples

Nearest Neighbour Approach • PEBLS: Value Difference Table for Outlook r: is set to 1. Cv 1: total number of examples with V 1 Cv 2: total number of examples with V 2 Civ 1: total number of examples with V 1 and of class i Civ 2: total number of examples with V 2 and of class i sunny overcast rain sunny 0 1. 2 0. 4 d ( sunny, overcast ) = overcast 1. 2 0 0. 8 rain 0. 4 0. 8 0 2 -4 + 3 -0 = 3+3= 1. 2 5 4 5 5

Nearest Neighbour Approach • PEBLS: Distance Function where w. X, w. Y: weights for X and Y, m: the number of attributes, xi, yi: values of the ith attribute for X and Y. where T: the total number of times that X is selected as the nearest neighbour, C: the total number of times that X correctly classifies examples.

Nearest Neighbour Approach • PEBLS: Distance Function (Example) Value Difference Tables Assuming row 1. weight = row 2. weight = 1, (row 1, row 2) = d(row 1 outlook, row 2 outlook)2 + d(row 1 temperature, row 2 temperature)2 + d(row 1 humidity, row 2 humidity)2 + d(row 1 windy, row 2 windy)2 = d(sunny, sunny)2 + d(hot, hot)2 + d(high, high)2 + d(false, true)2 = 0 + 0 + (1/2)2 =1/4

Nearest Neighbour Approach • PEBLS: Example 2 2 sunny hot high true N 1 3 overcast hot high false P 1 4 rain mild false P 1. 5 5 rain cool normal false P 1. 5 6 rain cool normal true N 2 7 overcast cool normal true P 1 8 sunny mild high false N 2 9 sunny cool normal false P 1 10 rain 1 mild normal false P 11 sunny mild normal true P 1 12 overcast mild high true P 2 13 overcast hot normal false P 1 14 rain mild high 1 true N overcast hot high false high ? false N Unsee n record high mild normal false P 11 sunny mild normal true P 1 12 overcast mild high true P 2 13 overcast hot normal false P 1 14 rain mild high 1 true N P 2 1 sunny hot high overcast hot high false Classifying an unseen record 1 sunny hot high false N After Training:

Artificial Neural Network Approach – Our brains are made up of about 100 billion tiny units called neurons. – Each neuron is connected to thousands of other neurons and communicates with them via electrochemical signals. – Signals coming into the neuron are received via junctions called synapses, these in turn are located at the end of branches of the neuron cell called dendrites. – The neuron continuously receives signals from these inputs – What the neuron does is sum up the inputs to itself in some way and then, if the end result is greater than some threshold value, the neuron fires. – It generates a voltage and outputs a signal along something called an axon.

Artificial Neural Network Approach • General Idea – The Model: A network of connected artificial neurons – Training: select a specific network topology and use the training example to tune the weights attached on the links connecting the neurons – To classify an unseen record X, feed the descriptive attribute values of the record into the network as inputs. The network computes an output value that can be converted to a class label

Artificial Neural Network Approach • Artificial Neuron (Unit) i 1 i 2 i 3 w 1 w 2 w 3 S y Sum function: Transformation function: x = w 1*i 1 + w 2*i 2 + w 3*i 3 Sigmoid(x) = 1/(1+e-x)

Artificial Neural Network Approach Ø Ø Ø A neural network can have many hidden layers, but one layer is normally considered sufficient The more units a hidden layer has, the more capacity of pattern recognition The constant inputs can be fed into the units in the hidden and output layers as inputs. Network with links from lower layers to upper layers feed-forward nw Network with links between nodes of the same layer recurrent nw

Artificial Neural Network Approach • Artificial Neuron (Perceptron) i 1 i 2 i 3 w 1 w 2 w 3 S y Sum function: Transformation function: x = w 1*i 1 + w 2*i 2 + w 3*i 3 Sigmoid(x) = 1/(1+e-x)

Artificial Neural Network Approach • General Principle for Training an ANN algorithm train. Network (Tr: training set) : Network Begin R = initial network with a particular topology; initialise the weight vector with random values w(0); repeat for each training example t=<xi, yi> in Tr do compute the predicted class output ŷ(k) for each weight wj in the weight vector do update the weight wj: wj(k+1) : = wj(k) + (yi - ŷ(k))xij end for; until stopping criterion is met return R end; : the learning factor. The more the value is, the bigger amount weight changes.

Artificial Neural Network Approach • Using ANN for Classification – Multiple hidden layers: • Do not know the actual class value and hence difficult to adjust the weight • Solution: Back-propagation (layer by layer from the output layer) – Model Overfitting: use validation examples to further tune the weights in the network – Descriptive attributes should be normalized or converted to binary – Training examples are used repeatedly. The training cost is therefore very high. – Difficulty in explaining classification decisions

Artificial Neural Network Approach • Network Topology – # of nodes in input layer: determined by # and data types of attributes: • Continuous and binary attributes: 1 node for each attribute • categorical attribute: convert to numeric or binary – Attribute w k labels needs at least log k nodes – # of nodes in output layer: determined by # of classess • For 2 class solution 1 node • K class solution at least log k nodes – # of hidden layers and nodes in the hidden layers: difficult to decide – in NWs with hidden laeyrs: updating weights using backpropagation

Model Evaluation and Selection • Evaluation metrics: How can we measure accuracy? Other metrics to consider? • Use validation test set of class-labeled tuples instead of training set when assessing accuracy • Methods for estimating a classifier’s accuracy: – Holdout method, random subsampling – Cross-validation – Bootstrap • Comparing classifiers: – Confidence intervals – Cost-benefit analysis and ROC Curves 79

Classifier Evaluation Metrics: Confusion Matrix: Actual classPredicted class yes no yes True Positives (TP) False Negatives (FN) no False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual classPredicted buy_computer class = yes = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 • TP and TN are the correctly predicted tuples • May have extra rows/columns to provide totals 80

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity Class Imbalance Problem: Y TP FN P n One class may be rare, e. g. N FP TN N fraud, or HIV-positive P’ N’ All n Significant majority of the negative class and minority of • Classifier Accuracy, or the positive class recognition rate: percentage of test set tuples that are correctly n Sensitivity: True Positive classified recognition rate Accuracy = (TP + TN)/All n Sensitivity = TP/P • Error rate: misclassification rate n Specificity: True Negative recognition rate 1 – accuracy, or n Specificity = TN/N Error rate = (FP + FN)/All AP Y N n 81

Classifier Evaluation Metrics: Precision and Recall, and F-measures • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples did the classifier label as positive? • Perfect score is 1. 0 • Inverse relationship between precision & recall • F measure (F 1 or F-score): harmonic mean of precision and recall, • Fß: weighted measure of precision and recall – assigns ß times as much weight to recall as to precision 82

Classifier Evaluation Metrics: Example – Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 30. 00 (sensitivity cancer = no 140 9560 9700 98. 56 (specificity) Total 230 9770 10000 96. 40 (accuracy) Precision = ? ? Recall = ? ? 83

Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods • Holdout method – Given data is randomly partitioned into two independent sets • Training set (e. g. , 2/3) for model construction • Test set (e. g. , 1/3) for accuracy estimation – Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) – Randomly partition the data into k mutually exclusive subsets, each approximately equal size – At i-th iteration, use Di as test set and others as training set – Leave-one-out: k folds where k = # of tuples, for small sized data 84

Evaluating Classifier Accuracy: Bootstrap • Bootstrap – Works well with small data sets – Samples the given training tuples uniformly with replacement • i. e. , each time a tuple is selected, it is equally likely to be selected again and re-added to the training set – Examples used for training set can be used for test set too 85

Ensemble Methods: Increasing the Accuracy • Ensemble methods – Use a combination of models to increase accuracy – Combine a series of k learned models, M 1, M 2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods – Bagging, boosting, Ensemble 86

Classification of Class-Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative ones, e. g. , medical diagnosis, fraud, oil-spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data • Typical methods for imbalance data in 2 -classification: – Oversampling: re-sampling of data from positive class – Under-sampling: randomly eliminate tuples from negative class 87

Model Selection: ROC Curves • • • ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Diagonal line: for every TP, equally likely to encounter FP The closer to the diagonal line (i. e. , the closer the area is to 0. 5), the less accurate is the model n n Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1. 0 88

Issues Affecting Model Selection • Accuracy – classifier accuracy: predicting class label • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability – understanding and insight provided by the model • Other measures, e. g. , goodness of rules, such as decision tree size or compactness of classification rules 89

Comparison of Techniques • Comparison of Approaches Model Interpretability: ease of understanding classification decisions Model maintenability: ease of modifying the model in the presence of new training examples Training cost: computational cost for building a model Classification cost: computational cost for classifying an unseen record

Decision Tree Induction in Weka • Overview – – – – ID 3 (only work for categorical attributes) J 48 (Java implementation of C 4. 5) Random. Tree (with K attributes) Random. Forest (a forest of random trees) REPTree (regression tree with reduced error pruning) BFTree (best-first tree, using Gain or Gini) FT (functional tree, logistic regression as split nodes) Simple. Cart (CART with cost-complexity pruning)

Decision Tree Induction in Weka • Preparation Pre-processing attributes if necessary Specifying the class attribute Selecting attributes

Decision Tree Induction in Weka • Constructing Classification Models (ID 3) 1. Choosing a method and setting parameters 2. Setting a test option 3. Starting the process 5. Selecting the option to view the tree 4. View the model and evaluation results

Decision Tree Induction in Weka • J 48 (unpruned tree)

Decision Tree Induction in Weka • Random. Tree

Decision Tree Induction in Weka • Classifying Unseen Records 1. Preparing unseen records in an ARFF file Class values are left as “unknown” (“? ”)

Decision Tree Induction in Weka • Classifying Unseen Records 2. Classifying unseen records in the file 1. Selecting this option and click Set… button 3. Press to start the classification 2. Press the button and load the file

Decision Tree Induction in Weka • Classifying Unseen Records 3. Saving Classification Results into a file 2. Setting both X and Y to instance_number 3. Saving the results into a file 1. Selecting the option to pop up visualisation

Decision Tree Induction in Weka • Classifying Unseen Records 4. Classification Results in an ARFF file Class labels assinged

Comparison of Techniques • Comparison of Performance in Weka – A system module known as Experimenter – Designated for comparing performances on techniques for classification over a single or a collection of data sets – Data miners setting up an experiment with: • • Selected data set(s) Selected algorithms(s) and times of repeated operations Selected test option (e. g. cross validation) Selected p value (indicating confidence) – Output accuracy rates of the algorithms – Pairwise comparison of algorithms with significant better and worse accuracy marked out.

Comparison of Techniques • Setting up Experiment in Weka Choosing a Test options Adding data sets The list of data sets selected New or existing experiment Naming the file to store experiment results No. of times each algorithm repeated Add an algorithm The list of selected algorithms

Comparison of Techniques • Experiment Results in Weka Analysi s method Value of significan ce Performing the Analysis Loading Experiment Data Results of Pairwise Comparisons

Classification in Practice • Process of a Classification Project 1. 2. 3. 4. 5. Locate data Prepare data Choose a classification method Construct the model and tune the model Measure its accuracy and go back to step 3 or 4 until the accuracy is satisfactory 6. Further evaluate the model from other aspects such as complexity, comprehensibility, etc. 7. Deliver the model and test it in real environment. Further modify the model if necessary

Classification in Practice • Data Preparation – – Identify descriptive features (input attributes) Identify or define the class Determine the sizes of the training, validation and test sets Select examples • • Spread and coverage of classes Spread and coverage of attribute values Null values Noisy data – Prepare the input values (categorical to continuous, continuous to categorical)

References (1) • • • C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121 -168, 1998 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95 H. Cheng, X. Yan, J. Han, and C. -W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07 H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08 W. Cohen. Fast effective rule induction. ICML'95 G. Cong, K. -L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. SIGMOD'05 106

References (3) • T. -S. Lim, W. -Y. Loh, and Y. -S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000. • J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994. • M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96. • T. M. Mitchell. Machine Learning. Mc. Graw Hill, 1997. • S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi. Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345 -389, 1998 • J. R. Quinlan. Induction of decision trees. Machine Learning, 1: 81 -106, 1986. • J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’ 93. • J. R. Quinlan. C 4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993. • J. R. Quinlan. Bagging, boosting, and c 4. 5. AAAI'96. 107

References (4) • • • R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’ 98. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’ 96. J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2 ed. Morgan Kaufmann, 2005. X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03. 108