Learning with Identification Trees Artificial Intelligence CMSC 25000

  • Slides: 28
Download presentation
Learning with Identification Trees Artificial Intelligence CMSC 25000 February 7, 2002

Learning with Identification Trees Artificial Intelligence CMSC 25000 February 7, 2002

Agenda • Midterm results • Learning from examples – Nearest Neighbor reminder – Identification

Agenda • Midterm results • Learning from examples – Nearest Neighbor reminder – Identification Trees: • • • Basic characteristics Sunburn example From trees to rules Learning by minimizing heterogeneity Analysis: Pros & Cons

Midterm Results Mean: 62. 5; Std. Dev. : 19. 5

Midterm Results Mean: 62. 5; Std. Dev. : 19. 5

Machine Learning: Review • Learning: – Automatically acquire a function from inputs to output

Machine Learning: Review • Learning: – Automatically acquire a function from inputs to output values, based on previously seen inputs and output values. – Input: Vector of feature values – Output: Value • Examples: Word pronunciation, robot motion, speech recognition

Machine Learning: Review • Key contrasts: – Supervised versus Unsupervised • With or without

Machine Learning: Review • Key contrasts: – Supervised versus Unsupervised • With or without labeled examples (known outputs) – Classification versus Regression • Output values: Discrete versus continuous-valued – Types of functions learned • aka “Inductive Bias” • Learning algorithm restricts things that can be learned

Machine Learning: Review • Key issues: – Feature selection: • What features should be

Machine Learning: Review • Key issues: – Feature selection: • What features should be used? • How do they relate to each other? • How sensitive is the technique to feature selection? – Irrelevant, noisy, absent feature; feature types – Complexity & Generalization • Tension between – Matching training data – Performing well on NEW UNSEEN inputs

Learning: Nearest Neighbor • Supervised, Classification or Regression, Vornoi diagrams • Training: – Record

Learning: Nearest Neighbor • Supervised, Classification or Regression, Vornoi diagrams • Training: – Record input vectors and associated outputs • Prediction: – Find “nearest” training vector to NEW input – Return associated output value • Advantages: Fast training, Very general • Disadvantages: Expensive prediction, definition of distance is complex, sensitive to feature & classification noise

Learning: Identification Trees • • (aka Decision Trees) Supervised learning Primarily classification Rectangular decision

Learning: Identification Trees • • (aka Decision Trees) Supervised learning Primarily classification Rectangular decision boundaries – More restrictive than nearest neighbor • Robust to irrelevant attributes, noise • Fast prediction

Sunburn Example

Sunburn Example

Learning about Sunburn • Goal: – Train on labeled examples – Predict Burn/None for

Learning about Sunburn • Goal: – Train on labeled examples – Predict Burn/None for new instances • Solution? ? – Exact match: same features, same output • Problem: 2*3^3 feature combinations – Could be much worse – Nearest Neighbor style • Problem: What’s close? Which features matter? – Many match on two features but differ on result

Learning about Sunburn • Better Solution: – Identification tree: – Training: • Divide examples

Learning about Sunburn • Better Solution: – Identification tree: – Training: • Divide examples into subsets based on feature tests • Sets of samples at leaves define classification – Prediction: • Route NEW instance through tree to leaf based on feature tests • Assign same value as samples at leaf

Sunburn Identification Tree Hair Color Blonde Brown Red Emily: Burn Lotion Used No Sarah:

Sunburn Identification Tree Hair Color Blonde Brown Red Emily: Burn Lotion Used No Sarah: Burn Annie: Burn Yes Katie: None Dana: None Alex: None John: None Pete: None

Simplicity • Occam’s Razor: – Simplest explanation that covers the data is best •

Simplicity • Occam’s Razor: – Simplest explanation that covers the data is best • Occam’s Razor for ID trees: – Smallest tree consistent with samples will be best predictor for new data • Problem: – Finding all trees & finding smallest: Expensive! • Solution: – Greedily build a small tree

Building ID Trees • Goal: Build a small tree such that all samples at

Building ID Trees • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: – At each node, pick test such that branches are closest to having same class • Split into subsets with least “disorder” – (Disorder ~ Entropy) – Find test that minimizes disorder

Minimizing Disorder Hair Color Blonde Sarah: B Dana: N Annie: B Katie: N Brown

Minimizing Disorder Hair Color Blonde Sarah: B Dana: N Annie: B Katie: N Brown Red Emily: B Alex: N Pete: N John: N Sarah: B Katie: N Average Dana: N Alex: N Annie: B Short Alex: N Annie: B Katie: N Average Sarah: B Emily: B John: N Tall Dana: N Pete: N Lotion Weight Light Heavy Emily: B Pete: N John: N No Sarah: B Annie: B Emily: B Pete: N John: N Yes Dana: N Alex: N Katie: N

Minimizing Disorder Height Short Annie: B Katie: N Sarah: B Katie: N Average Dana:

Minimizing Disorder Height Short Annie: B Katie: N Sarah: B Katie: N Average Dana: N Annie: B Sarah: B Dana: N Lotion Weight Light Average Tall Heavy No Sarah: B Annie: B Yes Dana: N Katie: N

Measuring Disorder • Problem: – In general, tests on large DB’s don’t yield homogeneous

Measuring Disorder • Problem: – In general, tests on large DB’s don’t yield homogeneous subsets • Solution: – General information theoretic measure of disorder – Desired features: • Homogeneous set: least disorder = 0 • Even split: most disorder = 1

Measuring Entropy • If split m objects into 2 bins size m 1 &

Measuring Entropy • If split m objects into 2 bins size m 1 & m 2, what is the entropy?

Measuring Disorder Entropy the probability of being in bin i Entropy (disorder) of a

Measuring Disorder Entropy the probability of being in bin i Entropy (disorder) of a split Assume p 1 p 2 Entropy 1 0 -1 log 21 - 0 log 20 = 0 - 0 = 0 ½ ½ ¼ ¾ -½ log 2½ - ½ log 2½ = ½ +½ = 1 -¼ log 2¼ - ¾ log 2¾ = 0. 5 + 0. 311 = 0. 811

Computing Disorder N instances Branch 1 N 1 a N 1 b Fraction of

Computing Disorder N instances Branch 1 N 1 a N 1 b Fraction of samples down branch i Branch 2 N 2 a N 2 b Disorder of class distribution on branch i

Entropy in Sunburn Example Hair color = 4/8(-2/4 log 2/4 - 2/4 log 2/4)

Entropy in Sunburn Example Hair color = 4/8(-2/4 log 2/4 - 2/4 log 2/4) + 1/8*0 + 3/8 *0 = 0. 5 Height = 0. 69 Weight = 0. 94 Lotion = 0. 61

Entropy in Sunburn Example Height Weight Lotion = 2/4(-1/2 log 1/2) + 1/4*0+1/4*0 =

Entropy in Sunburn Example Height Weight Lotion = 2/4(-1/2 log 1/2) + 1/4*0+1/4*0 = 0. 5 = 2/4(-1/2 log 1/2) +2/4(-1/2 log 1/2) = 1 =0

Building ID Trees with Disorder • Until each leaf is as homogeneous as possible

Building ID Trees with Disorder • Until each leaf is as homogeneous as possible – Select an inhomogeneous leaf node – Replace that leaf node by a test node creating subsets with least average disorder • Effectively creates set of rectangular regions – Repeatedly draws lines in different axes

Features in ID Trees: Pros • Feature selection: – Tests features that yield low

Features in ID Trees: Pros • Feature selection: – Tests features that yield low disorder • E. g. selects features that are important! – Ignores irrelevant features • Feature type handling: – Discrete type: 1 branch per value – Continuous type: Branch on >= value • Need to search to find best breakpoint • Absent features: Distribute uniformly

Features in ID Trees: Cons • Features – Assumed independent – If want group

Features in ID Trees: Cons • Features – Assumed independent – If want group effect, must model explicitly • E. g. make new feature Aor. B • Feature tests conjunctive

From Trees to Rules • Tree: – Branches from root to leaves = –

From Trees to Rules • Tree: – Branches from root to leaves = – Tests => classifications – Tests = if antecedents; Leaf labels= consequent – All ID trees-> rules; Not all rules as trees

From ID Trees to Rules Hair Color Blonde Brown Red Emily: Burn Lotion Used

From ID Trees to Rules Hair Color Blonde Brown Red Emily: Burn Lotion Used No Sarah: Burn Annie: Burn Yes Alex: None John: None Pete: None Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None))

Identification Trees • Train: – Build tree by forming subsets of least disorder •

Identification Trees • Train: – Build tree by forming subsets of least disorder • Predict: – Traverse tree based on feature tests – Assign leaf node sample label • Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading • Cons: Poor feature combination, dependency, optimal tree build intractable