Modeling Additive Structure and Detecting Interactions with Additive

  • Slides: 66
Download presentation
Modeling Additive Structure and Detecting Interactions with Additive Groves of Regression Trees Daria Sorokina

Modeling Additive Structure and Detecting Interactions with Additive Groves of Regression Trees Daria Sorokina Joint work with: Rich Caruana, Mirek Riedewald Artur Dubrawski, Jeff Schneider

Motivation: Cornell Lab of O Domain scientists want: 1. Good models 2. Domain knowledge

Motivation: Cornell Lab of O Domain scientists want: 1. Good models 2. Domain knowledge Can they get both? Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Which models are the best? o Recent major comparison of classification algorithms n (Caruana

Which models are the best? o Recent major comparison of classification algorithms n (Caruana & Niculescu-Mizil, ICML’ 06) Boosted Trees 0. 899 Random Forest 0. 896 Bagged Trees 0. 885 SVMs 0. 869 Neural Networks 0. 844 K-Nearest Neighbors 0. 811 Boosted Stumps 0. 792 Decision Trees 0. 698 Logistic Regression 0. 697 Naïve Bayes 0. 664 Trees! Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Which models are the best? o Recent major comparison of classification algorithms n (Caruana

Which models are the best? o Recent major comparison of classification algorithms n (Caruana & Niculescu-Mizil, ICML’ 06) Boosted Trees 0. 899 Random Forest 0. 896 Bagged Trees 0. 885 SVMs 0. 869 Neural Networks 0. 844 K-Nearest Neighbors 0. 811 Boosted Stumps 0. 792 Decision Trees 0. 698 Logistic Regression 0. 697 Naïve Bayes 0. 664 Random Forest o Average many large independent trees Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Which models are the best? o Recent major comparison of classification algorithms n (Caruana

Which models are the best? o Recent major comparison of classification algorithms n (Caruana & Niculescu-Mizil, ICML’ 06) Boosted Trees 0. 899 Random Forest 0. 896 Bagged Trees 0. 885 SVMs 0. 869 Neural Networks 0. 844 K-Nearest Neighbors 0. 811 Boosted Stumps 0. 792 Decision Trees 0. 698 Logistic Regression 0. 697 Naïve Bayes 0. 664 Boosting + + o Small trees, based on additive models Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions …

Trees in real-world models o Tree ensembles are hard to interpret n n o

Trees in real-world models o Tree ensembles are hard to interpret n n o This is a 1/100 of a real decision tree There can be ~500 trees in the ensemble Separate techniques are needed to infer domain knowledge Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves o High predictive performance o Domain knowledge extraction tools Additive Groves Boosted

Additive Groves o High predictive performance o Domain knowledge extraction tools Additive Groves Boosted Trees Random Forest Bagged Trees Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Introduction: Domain Knowledge o o Which features are important? n Feature selection techniques n

Introduction: Domain Knowledge o o Which features are important? n Feature selection techniques n Effect visualization techniques What effects do they have on the response variable? Toy example: seasonal effect on bird abundance # Birds Season n Is it always possible to visualize an effect of a single variable? Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Visualizing effects of features o Toy example 1: # Birds = F(season, #trees) Many

Visualizing effects of features o Toy example 1: # Birds = F(season, #trees) Many trees Few trees Averaged seasonal effect # Birds Season o Toy example 2: # Birds = F(season, latitude) South # Birds North Interaction Season Averaged seasonal effect ? # Birds Season Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions Season

! Statistical interactions are NOT correlations ! Daria Sorokina Additive Groves: Modeling Additive Structure

! Statistical interactions are NOT correlations ! Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Statistical Interaction o F (x 1, …, xn) has an interaction between xi and

Statistical Interaction o F (x 1, …, xn) has an interaction between xi and xj when depends on xj ( depends on xi ≡ or — for nominal and ordinal attributes — o …when difference in the value of F(x 1, …, xn) for different values of xi depends on the value of xj Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions )

Statistical Interactions o Statistical interactions ≡ non-additive effects among two or more variables in

Statistical Interactions o Statistical interactions ≡ non-additive effects among two or more variables in a function o F (x 1, …, xn) shows no interaction between xi and xj when F (x 1, x 2, …xn) = G (x 1, …, xi-1, xi+1, …, xn) + H (x 1 , …, xj-1, xj+1, …, xn), i. e. , G does not depend on xi, H does not depend on xj o Example: F(x 1, x 2, x 3) = sin(x 1+x 2) + x 2·x 3 n n n x 1, x 2 interact x 2, x 3 interact x 1, x 3 do not interact Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

How to test for an interaction: (Sorokina, Caruana, Riedewald, Fink; ICML’ 08) 1. Build

How to test for an interaction: (Sorokina, Caruana, Riedewald, Fink; ICML’ 08) 1. Build a model from the data. 2. Build a restricted model – do not allow interaction of interest. 3. Compare their predictive performance. n n If the restricted model is as good as the unrestricted – there is no interaction. If it fails to represent the data with the same quality – there is interaction. Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Learning Method Requirements 1. Non-linearity n If unrestricted model does not capture interactions, there

Learning Method Requirements 1. Non-linearity n If unrestricted model does not capture interactions, there is no chance to detect them 2. Restriction capability (additive structure) n The performance should not decrease after restriction when there are no interactions o Most existing prediction models do not fit both requirements at the same time n We had to invent our own algorithm that does Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves of Regression Trees (Sorokina, Caruana, Riedewald; Best Student Paper ECML’ 07) o

Additive Groves of Regression Trees (Sorokina, Caruana, Riedewald; Best Student Paper ECML’ 07) o New regression algorithm n Ensemble of regression trees o Based on n Bagging n Additive models n Combination of large trees and additive structure o Useful properties n High predictive performance n Captures interactions n Easy to restrict specific interactions Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Models Input X Model 1 P 1 Model 2 P 2 Model 3

Additive Models Input X Model 1 P 1 Model 2 P 2 Model 3 Prediction = P 1 + P 2 + P 3 Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) =

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) = P 1 + P 2 + P 3 ≈ Y {(X, Y)} {(X, Y-P 1 -P 2)} Model 1 Model 2 Model 3 {P 1} {P 2} {P 3} Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) =

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) = P 1 + P 2 + P 3 ≈ Y {(X, Y-P 2 -P 3)} {(X, Y-P 1 -P 2)} Model 1 Model 2 Model 3 {P 1’} {P 2} {P 3} Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) =

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) = P 1 + P 2 + P 3 ≈ Y {(X, Y-P 2 -P 3)} {(X, Y-P 1’-P 3)} {(X, Y-P 1 -P 2)} Model 1 Model 2 Model 3 {P 1’} {P 2’} {P 3} Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) =

Classical Training of Additive Models o Training Set: {(X, Y)} o Goal: M(X) = P 1 + P 2 + P 3 ≈ Y {(X, Y-P 2 -P 3)} {(X, Y-P 1’-P 3)} Model 1 Model 2 {P 1’} {P 2’} … Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves o Additive models fit additive components of the response function o A

Additive Groves o Additive models fit additive components of the response function o A Grove is an additive model where every single model is a tree o Additive Groves applies bagging on top of single Groves (1/N)· +…+ +…+ (1/N)· Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions +…+

Training Grove of Trees o Big trees can use the whole train set before

Training Grove of Trees o Big trees can use the whole train set before we are able to build all trees in a grove {(X, Y)} {(X, Y-P 1=0)} Empty Tree {P 1=Y} o Oops! We wanted several trees in our grove! {P 2=0} Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additve Groves: Layered Training o Solution: build Grove of small trees and gradually increase

Additve Groves: Layered Training o Solution: build Grove of small trees and gradually increase their size + +…+ Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training an Additive Grove o Consider two ways to create a larger grove from

Training an Additive Grove o Consider two ways to create a larger grove from a smaller one n “Vertical” n “Horizontal” + + o Test on validation set which one is better n We use out-of-bag data as validation set Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training an Additive Grove + + + + Daria Sorokina Additive Groves: Modeling Additive

Training an Additive Grove + + + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions + +

Training an Additive Grove + + + Daria Sorokina Additive Groves: Modeling Additive Structure

Training an Additive Grove + + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training an Additive Grove + + Daria Sorokina Additive Groves: Modeling Additive Structure and

Training an Additive Grove + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training an Additive Grove + + + + Daria Sorokina Additive Groves: Modeling Additive

Training an Additive Grove + + + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions + +

Experiments: Synthetic Data Set § X axis – size of leaves (~inverse of size

Experiments: Synthetic Data Set § X axis – size of leaves (~inverse of size of trees) § Y axis – number of trees in a grove 10 10 0. 4 0. 2 Dynamic programming 0. 3 5 0. 1 0. 05 0. 02 0. 01 0. 0050. 002 1 0. 5 0 0. 2 0. 13 0. 16 2 0. 5 Layered training 0. 2 4 0. 3 0. 16 0. 2 Bagged Groves trained as classical additive models 1 3 0. 2 0. 11 0. 1 2 3 4 0. 13 3 1 0. 5 0. 16 5 0. 12 13 0. 11 0. 3 0. 4 0. 12 0. 09 0. 16 6 0. 11 5 0. 2 1 7 09 8 0. 1 0. 2 6 9 0. 1 7 0. 110. 12 0. 13 8 2 0. 13 0. 16 9 0. 16 0. 2 0. 1 0. 05 0. 02 0. 01 0. 0050. 002 0 Randomized dynamic programming Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Comparison on Regression Data Sets 10 -Fold Cross Validation, RMSE California Housing Elevators Kinematics

Comparison on Regression Data Sets 10 -Fold Cross Validation, RMSE California Housing Elevators Kinematics Computer Activity Stock Additive Groves 0. 380 0. 015 0. 309 0. 028 0. 364 0. 013 0. 117 0. 009 0. 097 0. 029 Gradient boosting 0. 403 0. 014 0. 327 0. 035 0. 457 0. 012 0. 121 0. 01 0. 118 0. 05 Random Forests 0. 420 0. 013 0. 427 0. 058 0. 532 0. 013 0. 131 0. 012 0. 098 0. 026 Improvement v. r. GB 6% 6% 20% 3% 18% Improvement v. r. RF 10% 28% 32% 11% 1% Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Additive Groves outperform… o …Gradient Boosting n because of large trees – up to

Additive Groves outperform… o …Gradient Boosting n because of large trees – up to thousands of nodes (complex non-linear structure) o … Random Forests n because of modeling additive structure o Most existing algorithms do not combine these two properties Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

…and now back to interaction detection Daria Sorokina Additive Groves: Modeling Additive Structure and

…and now back to interaction detection Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Interaction detection: Learning Method Requirements 1. Non-linearity 2. Restriction capability (additive structure) Daria Sorokina

Interaction detection: Learning Method Requirements 1. Non-linearity 2. Restriction capability (additive structure) Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

How to test for an interaction: 1. Build a model from the data (no

How to test for an interaction: 1. Build a model from the data (no restrictions). 2. Build a restricted model – do not allow the interaction of interest. 3. Compare their predictive performance. n n If the restricted model is as good as the unrestricted – there is no interaction. If it fails to represent the data with the same quality – there is interaction. Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions between features A and B o Every single tree in the model should either not use A or not use B + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions between features A and B o Every single tree in the model should either not use A or not use B Evaluation on the separate vs. setno B novalidation A ? + + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions between features A and B o Every single tree in the model should either not use A or not use B Evaluation on the separate vs. setno B novalidation A + ? + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions

Training Restricted Grove of Trees o The model is not allowed to have interactions between features A and B o Every single tree in the model should either not use A or not use B vs. no A + + no B … Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Experiments: Synthetic Data 1, 2 2, 3 1, 2, 3 2, 7 7, 9

Experiments: Synthetic Data 1, 2 2, 3 1, 2, 3 2, 7 7, 9 Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Experiments: Synthetic Data X 4 is not involved in any interactions Daria Sorokina Additive

Experiments: Synthetic Data X 4 is not involved in any interactions Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Birds Ecology Application o Data: Rocky Mountains Bird Observatory Data Set n 30 species

Birds Ecology Application o Data: Rocky Mountains Bird Observatory Data Set n 30 species of birds inhabiting shortgrass prairies n 700 features describing the habitat o Goal: describe how environment influences bird abundance o Problems: really noisy real-world data Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Problems of Analyzing Real-World Data 1. Too many features n Most of them useless

Problems of Analyzing Real-World Data 1. Too many features n Most of them useless n Wrapper feature selection methods are too slow n Solution: fast feature ranking method Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

“Multiple Counting” – feature importance ranking for ensembles of bagged trees (Caruana et al;

“Multiple Counting” – feature importance ranking for ensembles of bagged trees (Caruana et al; KDD’ 06) o How many times per data point per tree each feature is used? o Imp(A) = 1. 6, Imp(B) = 0. 8, Imp(C) = 0. 2 o 500 times faster than sensitivity analysis! Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Problems of Analyzing Real-World Data 2. Correlations between the variables hurt interaction detection quality

Problems of Analyzing Real-World Data 2. Correlations between the variables hurt interaction detection quality n Need a small set of truly important features o Performance drops significantly if you remove any one of them n Solution: 2 nd round of feature selection by backward elimination o Eliminate least useful features one-by-one o Correlations will be removed Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Problems of Analyzing Real-World Data 3. parameter values for best performance ≠ best parameter

Problems of Analyzing Real-World Data 3. parameter values for best performance ≠ best parameter values for interaction detection (Additive Groves have two parameters controlling the complexity of the model – size of trees and number of trees) Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Choosing parameters for interaction detection Our choice for interaction detection o Need many additive

Choosing parameters for interaction detection Our choice for interaction detection o Need many additive components n (N≥ 6) o Predictive performance close to the best model n (~ 8σ difference) o Better to underfit than to overfit n (Favor left and lower grid points) Best predictive performance Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

RMBO data. Lark Bunting. Interaction: Elevation & Scrub/Shrubs Habitat o Fewer birds when more

RMBO data. Lark Bunting. Interaction: Elevation & Scrub/Shrubs Habitat o Fewer birds when more shrubs on high elevation, but more birds when more shrubs on low elevation o Scrub/shrub habitat contains different plant species in different regions of Rocky Mountains Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

RMBO data. Horned Lark. Interaction: Density of Roads & Wooded Wetland Habitat o More

RMBO data. Horned Lark. Interaction: Density of Roads & Wooded Wetland Habitat o More horned larks around roads n Previous knowledge o Fewer horned larks in woods n Previous knowledge o The effect of woods is diminished by presence of roads n New knowledge! Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Food Safety Application o USDA data: inspections conducted at meat processing plants o Goals:

Food Safety Application o USDA data: inspections conducted at meat processing plants o Goals: n Predict risk of Salmonella contamination n Identify most important factors o Constraint: n White-box models only o Model: n Logistic regression with built-in interactions Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Interaction Detection Results o Detected 5 interactions o 4 of them included slaughter_chicken variable

Interaction Detection Results o Detected 5 interactions o 4 of them included slaughter_chicken variable o Decision – split the data based on slaughter_chicken value n Build two LR models: one for plants that slaughter chickens and one for plants that do not Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Different Sets of Features Chicken slaughter present Chicken slaughter absent past_Salmonella_w 84 past_Salmonella_w 168

Different Sets of Features Chicken slaughter present Chicken slaughter absent past_Salmonella_w 84 past_Salmonella_w 168 Meat_Processing slaughter_Cattle Citation_xxx_w 56 aggr. Citation_xxx_w 84 region_Mid_Atlantic slaughter_Turkey past_Salmonella_w 28 Citation_xxx_w 168 past_Salmonella_w 14 region_West_North_Central Citation_xxx_w 168 region_West_South_Central aggr. Citation_xxx_w 84 Citation_xxx_w 28 Meat_Slaughter Citation_xxx_w 7 Citation_xxx_w 56 Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Competitions o KDD Cup’ 09 “Small” data set: n 3 CRM problems: churn, appetency,

Competitions o KDD Cup’ 09 “Small” data set: n 3 CRM problems: churn, appetency, upselling n Fast feature selection n Additive Groves n Best result on appetency o ICDM’ 09 Data Mining Contest n Brain fibers classification n 9 Additive Groves models n Third place in the supervised challenge Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Tree. Extra package ►A § § § set of machine learning tools Additive Groves

Tree. Extra package ►A § § § set of machine learning tools Additive Groves ensemble Bagged trees with fast feature ranking Descriptive analysis ►Feature selection (backward elimination) ►Interaction detection ►Effect visualization ► www. cs. cmu. edu/~daria/Tree. Extra. htm

Contributions o A new ensemble, Additive Groves of Regression Trees, combines additive structure and

Contributions o A new ensemble, Additive Groves of Regression Trees, combines additive structure and large trees (Sorokina et al, ECML’ 07) o Novel interaction detection technique based on comparing restricted and unrestricted Additive Groves models (Sorokina et al, ICML’ 08) o Fast feature selection methods (Caruana et al, KDD’ 06) o Contribution to bird ecology (Sorokina et al, DDDM workshop at ICDM’ 09) (Hochachka et al, Journal of Wildlife Management, 2007) o Contribution to food safety (Dubrawski et al, ISDS’ 09) o Data mining competitions (Sorokina, KDD Cup’ 09 workshop) o Software package www. cs. cmu. edu/~daria/Tree. Extra. htm Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Acknowledgements o o o o Rich Caruana Mirek Riedewald Giles Hooker Daniel Fink Steve

Acknowledgements o o o o Rich Caruana Mirek Riedewald Giles Hooker Daniel Fink Steve Kelling Wes Hochachka Art Munson Alex Niculescu-Mizil o Artur Dubrawski o Jeff Schneider o Karen Chen Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Appendix o Statistical interaction – alternative definition o Higher-order interactions n Definition n Restriction

Appendix o Statistical interaction – alternative definition o Higher-order interactions n Definition n Restriction algorithm n Reducing number of tests o Quantifying interaction size o Regression trees o Gradient Groves for binary classification Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Statistical Interaction o F (x 1, …, xn) has an interaction between xi and

Statistical Interaction o F (x 1, …, xn) has an interaction between xi and xj when depends on xj ( depends on xi ≡ or — for nominal and ordinal attributes — o …when difference in the value of F(x 1, …, xn) for different values of xi depends on the value of xj Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions )

Higher-Order Interactions o F(x) shows no K-way interaction between x 1, x 2, …,

Higher-Order Interactions o F(x) shows no K-way interaction between x 1, x 2, …, x. K when F(x) = F 1(x1) + F 2(x2) + … + FK(xK), where each Fi does not depend on xi o (x 1+x 2+x 3)-1 – has a 3 -way interaction o x 1+x 2+x 3 – has no interactions (neither 2 nor 3 -way) o x 1 x 2 + x 2 x 3 + x 1 x 3 – has all 2 -way interactions, but no 3 -way interaction Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Higher-Order Interactions o F(x) shows no K-way interaction between x 1, x 2, …,

Higher-Order Interactions o F(x) shows no K-way interaction between x 1, x 2, …, x. K when F(x) = F 1(x1) + F 2(x2) + … + FK(xK), where each Fi does not depend on xi o K-way restricted Grove: K candidates for each tree vs. no x 1 vs. … vs. no x 2 + ? + no x. K + Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Higher-Order Interactions o F (x) shows no K-way interaction between x 1, x 2,

Higher-Order Interactions o F (x) shows no K-way interaction between x 1, x 2, …, x. K when F(x) = F 1(x1) + F 2(x2) + … + FK(xK), where each Fi does not depend on xi o K-way interaction may exist only if all corresponding (K-1)-way interactions exist x 3 x 3 x 1 o x 2 x 1 x 2 Very few higher order interactions need to be tested in practice Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Quantifying Interaction Strength o Performance measure: standardized root mean squared error o Interaction strength:

Quantifying Interaction Strength o Performance measure: standardized root mean squared error o Interaction strength: difference in performances of restricted and unrestricted models o Significance threshold: 3 standard deviations of unrestricted performance n Randomization comes from different data samples (folds, bootstraps…) Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Regression trees used in Groves o Each split optimizes RMSE o Parameter α controls

Regression trees used in Groves o Each split optimizes RMSE o Parameter α controls the size of the tree n Node becomes a leaf if it contains ≤ α·|trainset| cases n 0 ≤ α ≤ 1, the smaller α, the larger the tree (Any other type of regression tree could be used. ) Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Gradient Groves: Merging Additive Groves with Gradient Boosting o From Gradient Boosting (Friedman, 2001)

Gradient Groves: Merging Additive Groves with Gradient Boosting o From Gradient Boosting (Friedman, 2001) n Training each tree as a step of a gradient descent in a functional space n Optimizing log-likelihood loss o From Additive Groves n n Retraining trees Stepwise increase of grove complexity Bagging of (generalized) additive models Benefits from large trees Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Gradient Groves: Modifications after Merging Groves with Gradient Boosting o Large trees can have

Gradient Groves: Modifications after Merging Groves with Gradient Boosting o Large trees can have pure nodes with predictions (log odds of 1) equal to ∞ Large tree +- +Inf + - n Special case, extra math o With infinite predictions, variance is too high n Threshold on max prediction, new parameter Γ -Inf Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions

Empirical comparison on real data Gradient Groves 0. 909 Boosted Trees 0. 899 Random

Empirical comparison on real data Gradient Groves 0. 909 Boosted Trees 0. 899 Random Forest 0. 896 Bagged Trees 0. 885 SVMs 0. 869 Neural Networks 0. 844 K-Nearest Neighbors 0. 811 Boosted Stumps 0. 792 Decision Trees 0. 698 Logistic Regression 0. 697 Naïve Bayes 0. 664 o Recent major comparison of classification algorithms n (Caruana & Niculescu-Mizil, ICML’ 06) o Results averaged over 8 performance measures and 11 data sets. o Gradient Groves were not always best, but never much worse than top algorithms. Daria Sorokina Additive Groves: Modeling Additive Structure and Detecting Statistical Interactions