Data Mining Classification Techniques Decision Trees BUSINESS INTELLIGENCE

Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU 1

Example of a Decision Tree al ric c at o eg c at al o eg ric in nt co u s u o ss a cl Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Training Data Married NO > 80 K YES Model: Decision Tree 2

Structure of a Decision Tree l ca g te Splitting Attributes l a ric o o ca g te s a ric u uo co in t n ss a cl Married Mar. St Single, Divorced Refund NO No Yes NO Tax. Inc < 80 K NO > 80 K YES There could be more than one tree that fits the same data! 3

Decision Tree Classification Task Decision Tree 4

Apply Model to Test Data Start from the root of tree. Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married NO > 80 K YES 5

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married NO > 80 K YES 6

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married NO > 80 K YES 7

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married NO > 80 K YES 8

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married NO > 80 K YES 9

Apply Model to Test Data Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Married Assign Cheat to “No” NO > 80 K YES 10

Decision Tree Classification Task Decision Tree 11

Decision Tree Induction l Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID 3, C 4. 5 – SLIQ, SPRINT 12

General Structure of Hunt’s Algorithm l l Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? 13

Hunt’s Algorithm Don’t Cheat Refund Yes No Don’t Cheat Single, Divorced Cheat Don’t Cheat Marital Status Married Single, Divorced No Marital Status Married Don’t Cheat Taxable Income Don’t Cheat < 80 K >= 80 K Don’t Cheat 14

Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records u. How to specify the attribute test condition? u. How to determine the best split? – Determine when to stop splitting 15

Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records u. How to specify the attribute test condition? u. How to determine the best split? – Determine when to stop splitting 16

How to Specify Test Condition? l Depends on attribute types – Nominal – Ordinal – Continuous l Depends on number of ways to split – 2 -way split – Multi-way split 17

Splitting Based on Nominal Attributes l Multi-way split: Use as many partitions as distinct values. Car. Type Family Luxury Sports l Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} Car. Type {Family} OR {Family, Luxury} Car. Type {Sports} 18

Splitting Based on Ordinal Attributes l Multi-way split: Use as many partitions as distinct values. Size Small Medium l Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} l Large Size {Large} What about this split? OR {Small, Large} {Medium, Large} Size {Small} Size {Medium} 19

Splitting Based on Continuous Attributes l Different ways of handling – Discretization to form an ordinal categorical attribute Static – discretize once at the beginning u Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. u – Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut u can be more compute intensive u 20

Splitting Based on Continuous Attributes 21

Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records u. How to specify the attribute test condition? u. How to determine the best split? – Determine when to stop splitting 22

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 23

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes C 0: 6 C 1: 4 No C 0: 4 C 1: 6 24

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes Car Type? No Family Luxury Sports C 0: 6 C 1: 4 C 0: 4 C 1: 6 C 0: 1 C 1: 3 C 0: 8 C 1: 0 C 0: 1 C 1: 7 25

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 26

How to determine the Best Split Greedy approach: – Nodes with homogeneous class distribution are preferred l Need a measure of node impurity: l Non-homogeneous, High degree of impurity Low degree of impurity 27

Measures of Node Impurity l Gini Index l Entropy l Misclassification error 28

How to Find the Best Split Before Splitting: M 0 A? Yes B? No Yes No Node N 1 Node N 2 Node N 3 Node N 4 M 1 M 2 M 3 M 4 M 12 M 34 Gain = M 0 – M 12 vs M 0 – M 34 29

Measure of Impurity: GINI l Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0. 0) when all records belong to one class, implying most interesting information 30

Examples for computing GINI P(C 1) = 0/6 = 0 P(C 2) = 6/6 = 1 Gini = 1 – P(C 1)2 – P(C 2)2 = 1 – 0 – 1 = 0 P(C 1) = 1/6 P(C 2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0. 278 P(C 1) = 2/6 P(C 2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0. 444 31

Splitting Based on GINI l l Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. 32

Binary Attributes: Computing GINI Index l l Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. B? Yes Gini(N 1) = 1 – (5/7)2 – (2/7)2 = 0. 204 Gini(N 2) = 1 – (1/5)2 – (4/5)2 = 0. 32 No Node N 1 Node N 2 N 1 N 2 C 1 5 1 C 2 2 4 Gini=0. 252 Gini(Children) = 7/12 * 0. 204 + 5/12 * 0. 320 = 0. 252 33

Categorical Attributes: Computing GINI Index l l For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values) 34

Continuous Attributes: Computing GINI Index l l Use Binary Decisions based on one value Several Choices for the splitting value – Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A v Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. 35

Continuous Attributes: Computing GINI Index l For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Sorted Values Split Positions 36

Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records u. How to specify the attribute test condition? u. How to determine the best split? – Determine when to stop splitting 37

Stopping Criteria for Tree Induction l Stop expanding a node when all the records belong to the same class l Stop expanding a node when all the records have similar attribute values l Set a threshold 38

Decision Tree Based Classification l Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – In general, does not require domain knowledge; no parameter setting – Useful for all types of data – Can be used for high-dimensional data u. May be useful with data sets with redundant attributes 39

Example: C 4. 5 Simple depth-first construction. l Uses Information Gain l Sorts Continuous Attributes at each node. l Needs entire data to fit in memory. l Unsuitable for Large Datasets. – Needs out-of-core sorting. l l You can download the software from: http: //www. cse. unsw. edu. au/~quinlan/c 4. 5 r 8. tar. gz 40