Decision Trees Decision tree representation n ID 3

Introduction Goal: Categorization n n Given an event, predict is category. Examples: n Who

Introduction n n Use a decision tree to predict categories for new events. Use

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Each internal

Word Sense Disambiguation n n Given an occurrence of a word, decide which sense,

Word Sense Disambiguation n Categories n n Use word sense labels (run 1, run

Word Sense Disambiguation n Example decision tree: pos noun near(stocking) yes run 4 no

WSD: Sample Training Data Features pos near(race) near(river) near(stockings) Word Sense noun no no

Decision Tree for Conjunction Outlook=Sunny Wind=Weak Outlook Sunny Wind Strong No Overcast No Rain

Decision Tree for Disjunction Outlook=Sunny Wind=Weak Outlook Sunny Yes Overcast Rain Wind Strong No

Decision Tree for XOR Outlook=Sunny XOR Wind=Weak Outlook Sunny Wind Strong Yes Overcast Rain

Decision Tree • decision trees represent disjunctions of conjunctions Outlook Sunny Humidity High No

When to consider Decision Trees n n n Instances describable by attribute-value pairs Target

Top-Down Induction of Decision Trees ID 3 A the “best” decision attribute for next

Which attribute is best? [29+, 35 -] A 1=? G [21+, 5 -] A

Entropy n n S is a sample of training examples p+ is the proportion

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly

Information Gain (S=E) n Gain(S, A): expected reduction in entropy due to sorting S

Information Gain Entropy([18+, 33 -]) = 0. 94 Entropy([21+, 5 -]) = 0. 71

Training Examples Day D 1 D 2 D 3 D 4 D 5 D

Selecting the Next Attribute S=[9+, 5 -] E=0. 940 Humidity Wind High [3+, 4

Selecting the Next Attribute S=[9+, 5 -] E=0. 940 Outlook Sunny Over cast Rain

Selecting the Next Attribute The information gain values for the 4 attributes are: •

ID 3 Algorithm [D 1, D 2, …, D 14] [9+, 5 -] Outlook

ID 3 Algorithm Outlook Sunny Humidity High No [D 1, D 2] Overcast Rain

Occam’s Razor ”If two theories explain the facts equally weel, then the simpler theory

Overfitting n One of the biggest problems with decision trees is Overfitting 27

Avoid Overfitting n n stop growing when split not statistically significant grow full tree,

Converting a Tree to Rules Outlook Sunny Humidity High No R 1: R 2:

Continuous Valued Attributes Create a discrete attribute to test continuous n Temperature = 24.

Unknown Attribute Values What if some examples have missing values of A? Use training

Cross-Validation n n Estimate the accuracy of an hypothesis induced by a supervised learning

Slides: 33

Download presentation

Decision Trees Decision tree representation n. ID 3 learning algorithm n. Entropy, information gain n. Overfitting n

Introduction Goal: Categorization n n Given an event, predict is category. Examples: n Who won a given ball game? n How should we file a given email? n What word sense was intended for a given occurrence of a word? Event = list of features. Examples: n Ball game: Which players were on offense? n Email: Who sent the email? n Disambiguation: What was the preceding word? 2

Introduction n n Use a decision tree to predict categories for new events. Use training data to build the decision tree. New Events Training Events and Categories Decision Tree Category 3

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes Each branch corresponds to an attribute value node Each leaf node assigns a classification 4

Word Sense Disambiguation n n Given an occurrence of a word, decide which sense, or meaning, was intended. Example: "run" n run 1: move swiftly (I ran to the store. ) n run 2: operate (I run a store. ) n run 3: flow (Water runs from the spring. ) n run 4: length of torn stitches (Her stockings had a run. ) n etc. 5

Word Sense Disambiguation n Categories n n Use word sense labels (run 1, run 2, etc. ) to name the possible categories. Features n n Features describe the context of the word we want to disambiguate. Possible features include: n n near(w): is the given word near an occurrence of word w? pos: the word’s part of speech left(w): is the word immediately preceded by the word w? etc. 6

Word Sense Disambiguation n Example decision tree: pos noun near(stocking) yes run 4 no verb near(race) yes run 1 no near(river) yes no run 3 (Note: Decision trees for WSD tend to be quite large) 7

WSD: Sample Training Data Features pos near(race) near(river) near(stockings) Word Sense noun no no no run 4 verb no no no run 1 verb no yes no run 3 noun yes yes run 4 verb no no yes run 1 verb yes no run 2 verb no yes run 3 8

Decision Tree for Conjunction Outlook=Sunny Wind=Weak Outlook Sunny Wind Strong No Overcast No Rain No Weak Yes 9

Decision Tree for Disjunction Outlook=Sunny Wind=Weak Outlook Sunny Yes Overcast Rain Wind Strong No Wind Weak Yes Strong No Weak Yes 10

Decision Tree for XOR Outlook=Sunny XOR Wind=Weak Outlook Sunny Wind Strong Yes Overcast Rain Wind Weak No Strong No Wind Weak Yes Strong No Weak Yes 11

Decision Tree • decision trees represent disjunctions of conjunctions Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes Wind Strong No Weak Yes (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) 12

When to consider Decision Trees n n n Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Missing attribute values Examples: n Medical diagnosis n Credit risk analysis n Object classification for robot manipulator (Tan 1993) 13

Top-Down Induction of Decision Trees ID 3 A the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch 5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes. 1. 2. 3. 4. 14

Which attribute is best? [29+, 35 -] A 1=? G [21+, 5 -] A 2=? [29+, 35 -] H L [8+, 30 -] [18+, 33 -] M [11+, 2 -] 15

Entropy n n S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log 2 p+ - p- log 2 p 16

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? n Information theory optimal length code assign –log 2 p bits to messages having probability p. n So the expected number of bits to encode (+ or -) of random member of S: n -p+ log 2 p+ - p- log 2 p- 17

Information Gain (S=E) n Gain(S, A): expected reduction in entropy due to sorting S on attribute A Entropy([29+, 35 -]) = -29/64 log 2 29/64 – 35/64 log 2 35/64 = 0. 99 [29+, 35 -] A 1=? G [21+, 5 -] A 2=? [29+, 35 -] H True [8+, 30 -] [18+, 33 -] False [11+, 2 -] 18

Information Gain Entropy([18+, 33 -]) = 0. 94 Entropy([21+, 5 -]) = 0. 71 Entropy([11+, 2 -]) = 0. 62 Entropy([8+, 30 -]) = 0. 74 Gain(S, A 2)=Entropy(S) Gain(S, A 1)=Entropy(S) -51/64*Entropy([18+, 33 -]) -26/64*Entropy([21+, 5 -]) -13/64*Entropy([11+, 2 -]) -38/64*Entropy([8+, 30 -]) =0. 12 =0. 27 [29+, 35 -] A 1=? True [21+, 5 -] A 2=? [29+, 35 -] False [8+, 30 -] True [18+, 33 -] False [11+, 2 -] 19

Training Examples Day D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 D 11 D 12 D 13 D 14 Outlook Sunny Overcast Rain Overcast Sunny Rain Sunny Overcast Rain Temp. Hot Hot Mild Cool Mild Cold Mild Hot Mild Humidity High Normal Normal High Wind Weak Strong Weak Weak Strong Weak Strong Play Tennis No No Yes Yes Yes No 20

Selecting the Next Attribute S=[9+, 5 -] E=0. 940 Humidity Wind High [3+, 4 -] E=0. 985 Normal Weak Strong [6+, 1 -] [6+, 2 -] E=0. 592 E=0. 811 E=1. 0 Gain(S, Wind) =0. 940 -(8/14)*0. 811 – (6/14)*1. 0 =0. 048 Gain(S, Humidity) =0. 940 -(7/14)*0. 985 – (7/14)*0. 592 =0. 151 [3+, 3 -] Humidity provides greater info. gain than Wind, w. r. t target classification. 21

Selecting the Next Attribute S=[9+, 5 -] E=0. 940 Outlook Sunny Over cast Rain [2+, 3 -] [4+, 0] [3+, 2 -] E=0. 971 E=0. 0 E=0. 971 Gain(S, Outlook) =0. 940 -(5/14)*0. 971 -(4/14)*0. 0 – (5/14)*0. 0971 =0. 247 22

Selecting the Next Attribute The information gain values for the 4 attributes are: • Gain(S, Outlook) =0. 247 • Gain(S, Humidity) =0. 151 • Gain(S, Wind) =0. 048 • Gain(S, Temperature) =0. 029 where S denotes the collection of training examples 23

ID 3 Algorithm [D 1, D 2, …, D 14] [9+, 5 -] Outlook Sunny Overcast Rain Ssunny =[D 1, D 2, D 8, D 9, D 11] [D 3, D 7, D 12, D 13] [D 4, D 5, D 6, D 10, D 14] [2+, 3 -] [4+, 0 -] [3+, 2 -] ? Yes ? Gain(Ssunny, Humidity)=0. 970 -(3/5)0. 0 – 2/5(0. 0) = 0. 970 Gain(Ssunny, Temp. )=0. 970 -(2/5)0. 0 – 2/5(1. 0)-(1/5)0. 0 = 0. 570 Gain(Ssunny, Wind)=0. 970= -(2/5)1. 0 – 3/5(0. 918) = 0. 019 24

ID 3 Algorithm Outlook Sunny Humidity High No [D 1, D 2] Overcast Rain Yes [D 3, D 7, D 12, D 13] Normal Yes [D 8, D 9, D 11] [mistake] Wind Strong Weak No Yes [D 6, D 14] [D 4, D 5, D 10] 25

Occam’s Razor ”If two theories explain the facts equally weel, then the simpler theory is to be preferred” Arguments in favor: n Fewer short hypotheses than long hypotheses n A short hypothesis that fits the data is unlikely to be a coincidence n A long hypothesis that fits the data might be a coincidence Arguments opposed: n There are many ways to define small sets of hypotheses 26

Overfitting n One of the biggest problems with decision trees is Overfitting 27

Avoid Overfitting n n stop growing when split not statistically significant grow full tree, then post-prune Select “best” tree: n measure performance over training data n measure performance over separate validation data set n min( |tree|+|misclassifications(tree)|) 28

Effect of Reduced Error Pruning 29

Converting a Tree to Rules Outlook Sunny Humidity High No R 1: R 2: R 3: R 4: R 5: If If If Overcast Rain Yes Normal Yes Wind Strong No Weak Yes (Outlook=Sunny) (Humidity=High) Then Play. Tennis=No (Outlook=Sunny) (Humidity=Normal) Then Play. Tennis=Yes (Outlook=Overcast) Then Play. Tennis=Yes (Outlook=Rain) (Wind=Strong) Then Play. Tennis=No (Outlook=Rain) (Wind=Weak) Then Play. Tennis=Yes 30

Continuous Valued Attributes Create a discrete attribute to test continuous n Temperature = 24. 50 C n (Temperature > 20. 00 C) = {true, false} Where to set the threshold? Temperature 150 C 180 C 190 C 220 C 240 C 270 C Play. Tennis No No Yes Yes No 31

Unknown Attribute Values What if some examples have missing values of A? Use training example anyway sort through tree n If node n tests A, assign most common value of A among other examples sorted to node n. n Assign most common value of A among other examples with same target value n Assign probability pi to each possible value vi of A n Assign fraction pi of example to each descendant in tree Classify new examples in the same fashion 32

Cross-Validation n n Estimate the accuracy of an hypothesis induced by a supervised learning algorithm Predict the accuracy of an hypothesis over future unseen instances Select the optimal hypothesis from a given set of alternative hypotheses n Pruning decision trees n Model selection n Feature selection Combining multiple classifiers (boosting) 33