SLIQ and SPRINT for disk resident data SLIQ

SLIQ • SLIQ is a decision tree classifier that can handle both numerical and

Issues • There are two major, critical performance, issues in the treegrowth phase: –

Some Data rid age salary marital car 1 30 60 single sports 2 25

SLIQ - Attribute Lists rid age rid salary rid marital 1 30 1 60

SLIQ - Sort Numeric, Group Categorical rid age rid salary rid marital 2 25

SLIQ - Class List rid car LEAF 1 sports N 1 2 mini N

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 1

SLIQ - Histograms rid salary rid car LEAF 2 20 1 sports N 1

SLIQ - Histograms rid marital rid car LEAF 3 married 1 sports N 1

SLIQ - Perform best split and Update Class List rid salary rid car LEAF

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 2

SLIQ - Pseudocode • Split evaluation: Evaluate. Splits() for each numeric attribute A do

SLIQ - Pseudocode • Updating the class list Update. Labels() for each split leaf

SLIQ - bottleneck • Class-list must remain memory resident at all time! – Although

SPRINT rid age car rid salary car rid marital car 2 25 mini 2

SPRINT - Histograms rid age car 2 25 mini 1 30 sports 6 35

SPRINT - Histograms rid salary car 2 20 mini 9 30 mini 1 60

SPRINT - Histograms rid marital car 3 married van 5 married luxury 7 married

SPRINT - Performing Best Split • Once the best split point has been found

SPRINT - Performing Best Split • Unfortunately, for the remaining attribute lists of the

SPRINT - Performing Best Split • If the hash-table is too large for the

Slides: 25

Download presentation

SLIQ and SPRINT for disk resident data

SLIQ • SLIQ is a decision tree classifier that can handle both numerical and categorical attributes • Builds compact and accurate trees • Uses a pre-sorting technique in the tree growing phase • Suitable for classification of large disk-resident datasets.

Issues • There are two major, critical performance, issues in the treegrowth phase: – How to find split points – How to partition the data • The well-known decision tree classifiers: – Grow trees depth-first – Repeatedly sort the data at every node • SLIQ: – Replace this repeated sorting with one-time sort – Use new a data structure call class-list – Class-list must remain memory resident at all time

Some Data rid age salary marital car 1 30 60 single sports 2 25 20 single mini 3 40 80 married van 4 45 100 single luxury 5 60 150 married luxury 6 35 120 single sports 7 50 70 married van 8 55 90 single sports 9 65 30 married mini 10 70 200 single luxury

SLIQ - Attribute Lists rid age rid salary rid marital 1 30 1 60 1 single 2 25 2 20 2 single 3 40 3 80 3 married 4 45 4 100 4 single 5 60 5 150 5 married 6 35 6 120 6 single 7 50 7 70 7 married 8 55 8 90 8 single 9 65 9 30 9 married 10 70 10 200 10 single These are projections on (rid, attribute).

SLIQ - Sort Numeric, Group Categorical rid age rid salary rid marital 2 25 2 20 3 married 1 30 9 30 5 married 6 35 1 60 7 married 3 40 7 70 9 married 4 45 3 80 1 single 7 50 8 90 2 single 8 55 4 100 4 single 5 60 6 120 6 single 9 65 5 150 8 single 10 70 10 200 10 single

SLIQ - Class List rid car LEAF 1 sports N 1 2 mini N 1 3 van N 1 4 luxury N 1 5 luxury N 1 6 sports N 1 7 van N 1 8 sports N 1 9 mini N 1 10 luxury N 1

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 1 1 30 2 mini N 1 6 35 3 van N 1 3 40 4 luxury N 1 4 45 5 luxury N 1 7 50 6 sports N 1 8 55 7 van N 1 5 60 8 sports N 1 9 65 9 mini N 1 10 70 10 luxury N 1 age 25 ? mini van luxury L 0 0 R 3 2 2 3 sports mini van luxury L R age 30 ? Evaluate each split, using GINI or Entropy. sports L R . . .

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 1 1 30 2 mini N 1 6 35 3 van N 1 3 40 4 luxury N 1 4 45 5 luxury N 1 7 50 6 sports N 1 8 55 7 van N 1 5 60 8 sports N 1 9 65 9 mini N 1 10 70 10 luxury N 1 age 25 age 30 Evaluate each split, using GINI or Entropy. sports mini van luxury L 0 0 R 3 2 2 3 sports mini van luxury L 0 1 0 0 R 3 1 2 3 sports mini van luxury L 1 1 0 0 R 2 1 2 3 . . .

SLIQ - Histograms rid salary rid car LEAF 2 20 1 sports N 1 9 30 2 mini N 1 1 60 3 van N 1 7 70 4 luxury N 1 3 80 5 luxury N 1 8 90 6 sports N 1 4 100 7 van N 1 6 120 8 sports N 1 5 150 9 mini N 1 10 200 10 luxury N 1 Evaluate each split, using GINI or Entropy. N 1 salary 20 salary 30 sports mini van luxury L 0 0 R 3 2 2 3 sports mini van luxury L 0 1 0 0 R 3 1 2 3 sports mini van luxury L 0 2 0 0 R 3 0 2 3 . . .

SLIQ - Histograms rid marital rid car LEAF 3 married 1 sports N 1 5 married 2 mini N 1 7 married 3 van N 1 9 married 4 luxury N 1 1 single 5 luxury N 1 2 single 6 sports N 1 4 single 7 van N 1 6 single 8 sports N 1 8 single 9 mini N 1 10 single 10 luxury N 1 Evaluate each split, using GINI or Entropy. N 1 Married sports mini van luxury Yes 0 1 2 1 No 3 1 0 2 sports mini van luxury Yes 3 1 0 2 No 0 1 2 1 Single

SLIQ - Perform best split and Update Class List rid salary rid car LEAF 2 20 1 sports N 1 9 30 2 mini N 1 1 60 3 van N 1 7 70 4 luxury N 1 3 80 5 luxury N 1 8 90 6 sports N 1 4 100 7 van N 1 6 120 8 sports N 1 5 150 9 mini N 1 10 200 10 luxury N 1 N 2 salary 60 N 3

SLIQ - Perform best split and Update Class List rid salary rid car LEAF 2 20 1 sports N 2 9 30 2 mini N 2 1 60 3 van N 3 7 70 4 luxury N 3 3 80 5 luxury N 3 8 90 6 sports N 3 4 100 7 van N 3 6 120 8 sports N 3 5 150 9 mini N 2 10 200 10 luxury N 3 N 1 N 2 salary 60 N 3

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 2 1 30 2 mini N 2 6 35 3 van N 3 3 40 4 luxury N 3 4 45 5 luxury N 3 7 50 6 sports N 3 8 55 7 van N 3 5 60 8 sports N 3 9 65 9 mini N 2 10 70 10 luxury N 3 N 1 N 2 N 3 sports mini van luxury L 0 0 R 1 1 1 0 sports mini van luxury L 0 0 R 2 0 2 3 sports mini van luxury N 1 N 2 N 1 L R age 25 ? Evaluate each split, using GINI or Entropy. salary 60 N 2 L R . . .

SLIQ - Histograms rid age rid car LEAF 2 25 1 sports N 2 1 30 2 mini N 2 6 35 3 van N 3 3 40 4 luxury N 3 4 45 5 luxury N 3 7 50 6 sports N 3 8 55 7 van N 3 5 60 8 sports N 3 9 65 9 mini N 2 10 70 10 luxury N 3 N 1 N 2 N 3 sports mini van luxury L 0 0 R 1 1 1 0 sports mini van luxury L 0 0 R 2 0 2 3 sports mini van luxury L 0 1 0 0 R 1 0 sports mini van luxury L 0 0 R 2 0 2 3 N 1 N 2 N 1 age 25 Evaluate each split, using GINI or Entropy. salary 60 N 2 . . .

SLIQ - Pseudocode • Split evaluation: Evaluate. Splits() for each numeric attribute A do for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node Ni update the class histogram in leaf Ni compute splitting score for test (A ≤ v) for Ni for each categorical attribute A do for each leaf of the tree do find subset of A with best split

SLIQ - Pseudocode • Updating the class list Update. Labels() for each split leaf Ni do Let A be the split attribute for Ni. for each (rid, v) in the attribute list for A do find the corresponding entry in the class list e (using the rid) if the leaf referenced by e is Ni then find the new leaf Nj to which (rid, v) belongs (by applying the splitting test) update the leaf pointer for e to Nj

SLIQ - bottleneck • Class-list must remain memory resident at all time! – Although not a big problem with today's memories, still there might be cases where this is a bottleneck. • So, what can we do when the class-list doesn't fit in main memory? – SPRINT is a solution. . .

SPRINT rid age car rid salary car rid marital car 2 25 mini 2 20 mini 3 married van 1 30 sports 9 30 mini 5 married luxury 6 35 sports 1 60 sports 7 married van 3 40 van 7 70 van 9 married mini 4 45 luxury 3 80 van 1 single sports 7 50 van 8 90 sports 2 single mini 8 55 sports 4 100 luxury 4 single luxury 5 60 luxury 6 120 sports 6 single sports 9 65 mini 5 150 luxury 8 single sports 10 70 luxury 10 200 luxury 10 single luxury The main data structures used in SPRINT are: Attribute lists and Class histograms

SPRINT - Histograms rid age car 2 25 mini 1 30 sports 6 35 sports 3 40 van 4 45 luxury 7 50 van 8 55 sports 5 60 luxury 9 65 mini 10 70 luxury Evaluate each split, using GINI or Entropy. age 25 age 30 sports mini van luxury L 0 0 R 3 2 2 3 sports mini van luxury L 0 1 0 0 R 3 1 2 3 sports mini van luxury L 1 1 0 0 R 2 1 2 3 . . .

SPRINT - Histograms rid salary car 2 20 mini 9 30 mini 1 60 sports 7 70 van 3 80 van 8 90 sports 4 100 luxury 6 120 sports 5 150 luxury 10 200 luxury Evaluate each split, using GINI or Entropy. salary 20 salary 30 sports mini van luxury L 0 0 R 3 2 2 3 sports mini van luxury L 0 1 0 0 R 3 1 2 3 sports mini van luxury L 0 2 0 0 R 3 0 2 3 . . .

SPRINT - Histograms rid marital car 3 married van 5 married luxury 7 married van 9 married mini 1 single sports 2 single mini 4 single luxury 6 single sports 8 single sports 10 single luxury Evaluate each split, using GINI or Entropy. Married sports mini van luxury Yes 0 1 2 1 No 3 1 0 2 sports mini van luxury Yes 3 1 0 2 No 0 1 2 1 Single

SPRINT - Performing Best Split • Once the best split point has been found for a node, we execute the split by creating child nodes. • Requires splitting the node’s lists for every attribute into two. • Partitioning the attribute list of the winning attribute (salary) is easy. – We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.

SPRINT - Performing Best Split • Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records. • Solution: use the rids. – As we partition the list of the splitting attribute (i. e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved. • Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. – The retrieved information tells us with which child to place the record.

SPRINT - Performing Best Split • If the hash-table is too large for the memory, splitting is done in more than one step. – The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory; – Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.