Ch 6 Decision Trees Stephen Marsland Machine Learning

  • Slides: 39
Download presentation
Ch. 6: Decision Trees Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based

Ch. 6: Decision Trees Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from Stephen Marsland, Jia Li, and Ruoming Jin. Longin Jan Latecki Temple University latecki@temple. edu 3. 1

Illustrating Classification Task 3. 2

Illustrating Classification Task 3. 2

Decision Trees Split classification down into a series of choices about features in turn

Decision Trees Split classification down into a series of choices about features in turn Lay them out in a tree Progress down the tree to the leaves 159. 302 3. 3 Stephen Marsland

Play Tennis Example Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak

Play Tennis Example Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak No Yes 159. 302 3. 4 Stephen Marsland

Day Outlook Temp Humid Wind Play? 1 Sunny Hot High Weak No 2 Sunny

Day Outlook Temp Humid Wind Play? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild 159. 302 High 3. 5 Strong No Stephen Marsland

Rules and Decision Trees Can turn the tree into a set of rules: (outlook

Rules and Decision Trees Can turn the tree into a set of rules: (outlook = sunny & humidity = normal) | (outlook = overcast) | (outlook = rain & wind = weak) How do we generate the trees? Need to choose features Need to choose order of features 159. 302 3. 6 Stephen Marsland

A tree structure classification rule for a medical example 159. 302 3. 7 Stephen

A tree structure classification rule for a medical example 159. 302 3. 7 Stephen Marsland

The construction of a tree involves the following three elements: 1. The selection of

The construction of a tree involves the following three elements: 1. The selection of the splits. 2. The decisions when to declare a node as terminal or to continue splitting. 3. The assignment of each terminal node to one of the classes. 159. 302 3. 8 Stephen Marsland

Goodness of split The goodness of split is measured by an impurity function defined

Goodness of split The goodness of split is measured by an impurity function defined for each node. Intuitively, we want each leaf node to be “pure”, that is, one class dominates. 159. 302 3. 9 Stephen Marsland

How to determine the Best Split Before Splitting: 10 records of class 0, 10

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 3. 10

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Low degree of impurity 3. 11

Measures of Node Impurity Entropy Gini Index 3. 12

Measures of Node Impurity Entropy Gini Index 3. 12

Entropy Let F be a feature with possible values f 1, …, fn Let

Entropy Let F be a feature with possible values f 1, …, fn Let p be a pdf (probability density function) of F; usually p is simply given by a histogram (p 1, …, pn), where pi is the proportion of the data that has value F=fi. Entropy of p tells us how much extra information we get from knowing the value of the feature, i. e, F=fi for a given data point. Measures the amount in impurity in the set of features Makes sense to pick the features that provides the most information 159. 302 3. 13 Stephen Marsland

E. g. , if F is a feature with two possible values +1 and

E. g. , if F is a feature with two possible values +1 and -1, and p 1=1 and p 2=0, then we get no new information from knowing that F=+1 for a given example. Thus the entropy is zero. If p 1=0. 5 and p 2=0. 5, then the entropy is at maximum. 159. 302 3. 14 Stephen Marsland

Entropy and Decision Tree Entropy at a given node t: (NOTE: p( j |

Entropy and Decision Tree Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0. 0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations 3. 15

Examples for computing Entropy P(C 1) = 0/6 = 0 P(C 2) = 6/6

Examples for computing Entropy P(C 1) = 0/6 = 0 P(C 2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 = 0 P(C 1) = 1/6 P(C 2) = 5/6 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (1/6) = 0. 65 P(C 1) = 2/6 P(C 2) = 4/6 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0. 92 3. 16

Splitting Based on Information Gain: Parent Node, p is split into k partitions; ni

Splitting Based on Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID 3 and C 4. 5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. 3. 17

Information Gain Choose the feature that provides the highest information gain over all examples

Information Gain Choose the feature that provides the highest information gain over all examples That is all there is to ID 3: At each stage, pick the feature with the highest information gain 159. 302 3. 18 Stephen Marsland

Example Values(Wind) = Weak, Strong S = [9+, 5 -] S(Weak) <- [6+, 2

Example Values(Wind) = Weak, Strong S = [9+, 5 -] S(Weak) <- [6+, 2 -] S(Strong) <- [3+, 3 -] Gain(S, Wind) = Entropy(S) (8/14) Entropy(S(Weak)) (6/14)Entropy(S(Strong)) = 0. 94 (8/14)0. 811 (6/14)1. 00 159. 302 Stephen Marsland 3. 19

Day Outlook Temp Humid Wind Play? 1 Sunny Hot High Weak No 2 Sunny

Day Outlook Temp Humid Wind Play? 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild 159. 302 High 3. 20 Strong No Stephen Marsland

ID 3 (Quinlan) Search over all possible trees Greedy search - no backtracking Susceptible

ID 3 (Quinlan) Search over all possible trees Greedy search - no backtracking Susceptible to local minima Uses all features - no pruning Can deal with noise Labels are most common value of examples 159. 302 3. 21 Stephen Marsland

ID 3 Feature F with highest Gain(S, F) F v Leaf node, category c

ID 3 Feature F with highest Gain(S, F) F v Leaf node, category c 1 Sv only contains examples in category c 1 159. 302 w Feature G with highest Gain(Sw, G) G y Etc. 3. 22 x Etc. Stephen Marsland

Search 159. 302 3. 23 Stephen Marsland

Search 159. 302 3. 23 Stephen Marsland

Example [9+, 5 -] Outlook 159. 302 Sunny Overcast [2+, 3 -] ? [4+,

Example [9+, 5 -] Outlook 159. 302 Sunny Overcast [2+, 3 -] ? [4+, 0 -] Yes 3. 24 Rain [3+, 2 -] ? Stephen Marsland

Inductive Bias How does the algorithm generalise from the training examples? Choose features with

Inductive Bias How does the algorithm generalise from the training examples? Choose features with highest information gain Minimise amount of information is left Bias towards shorter trees Occam’s Razor Put most useful features near root 159. 302 3. 25 Stephen Marsland

Missing Data Suppose that one feature has no value Can miss out that node

Missing Data Suppose that one feature has no value Can miss out that node and carry on down the tree, following all paths out of that node Can therefore still get a classification Virtually impossible with neural networks 159. 302 3. 26 Stephen Marsland

C 4. 5 Improved version of ID 3, also by Quinlan Use a validation

C 4. 5 Improved version of ID 3, also by Quinlan Use a validation set to avoid overfitting Could just stop choosing features (early stopping) Better results from post-pruning Make whole tree Chop off some parts of tree afterwards 159. 302 3. 27 Stephen Marsland

Post-Pruning Run over tree Prune each node by replacing subtree below with a leaf

Post-Pruning Run over tree Prune each node by replacing subtree below with a leaf Evaluate error and keep if error same or better 159. 302 3. 28 Stephen Marsland

Rule Post-Pruning Turn tree into set of if-then rules Remove preconditions from each rule

Rule Post-Pruning Turn tree into set of if-then rules Remove preconditions from each rule in turn, and check accuracy Sort rules according to accuracy Rules are easy to read 159. 302 3. 29 Stephen Marsland

Rule Post-Pruning IF ((outlook = sunny) & (humidity = high)) THEN play. Tennis =

Rule Post-Pruning IF ((outlook = sunny) & (humidity = high)) THEN play. Tennis = no Remove preconditions: Consider IF (outlook = sunny) And IF (humidity = high) Test accuracy If one of them is better, try removing both 159. 302 3. 30 Stephen Marsland

ID 3 Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong

ID 3 Decision Tree Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak No Yes 159. 302 3. 31 Stephen Marsland

Test Case Outlook = Sunny Temperature = Cool Humidity = High Wind = Strong

Test Case Outlook = Sunny Temperature = Cool Humidity = High Wind = Strong 159. 302 3. 32 Stephen Marsland

Party Example, Section 6. 4, p. 147 Construct a decision tree based on these

Party Example, Section 6. 4, p. 147 Construct a decision tree based on these data: Deadline, Urgent, Near, None, Near, Urgent, 159. 302 Party, Yes, No, Yes, No, Lazy, Yes, No, No, Yes, No, 3. 33 Activity Party Study Party Pub Party Study TV Party Study Stephen Marsland

Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p(

Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0. 0) when all records belong to one class, implying most interesting information 3. 35

Examples for computing GINI P(C 1) = 0/6 = 0 P(C 2) = 6/6

Examples for computing GINI P(C 1) = 0/6 = 0 P(C 2) = 6/6 = 1 Gini = 1 – P(C 1)2 – P(C 2)2 = 1 – 0 – 1 = 0 P(C 1) = 1/6 P(C 2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0. 278 P(C 1) = 2/6 P(C 2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0. 444 3. 36

Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is

Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. 3. 37

Binary Attributes: Computing GINI Index p p Splits into two partitions Larger and Purer

Binary Attributes: Computing GINI Index p p Splits into two partitions Larger and Purer Partitions are sought for B? Yes No Node N 1 Node N 2 Gini(N 1) = 1 – (5/6)2 – (2/6)2 = 0. 194 Gini(N 2) = 1 – (1/6)2 – (4/6)2 = 0. 528 Gini(Children) = 7/12 * 0. 194 + 5/12 * 0. 528 = 0. 333 3. 38

Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class

Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Two-way split (find best partition of values) Multi-way split 3. 39

Homework Written homework for everyone, due on Oct. 5: Problem 6. 3, p. 151.

Homework Written homework for everyone, due on Oct. 5: Problem 6. 3, p. 151. Matlab Homework: Demonstrate decision tree learning and classification on the party example. 3. 40