Bagging and Boosting in Data Mining Carolina Ruiz

Bagging and Boosting in Data Mining Carolina Ruiz ruiz@cs. wpi. edu http: //www. cs. wpi. edu/~ruiz

2 Motivation and Background n Problem Definition: n n n Given: a dataset of instances and a target concept Find: a model (e. g. set of association rules, decision tree, neural network) that helps in predicting the classification of unseen instances. Difficulties: n n The model should be stable (i. e. shouldn’t depend too much on input data used to construct it) The model should be a good predictor (difficult to achieve when input dataset is small)

3 Two Approaches n Bagging (Bootstrap Aggregating) n n Leo Breiman, UC Berkeley Boosting n Rob Schapire, ATT Research n Jerry Friedman, Stanford U.

4 Bagging n Model Creation: n n Prediction: n n Create bootstrap replicates of the dataset and fit a model to each one Average/vote predictions of each model Advantages n n Stabilizes “unstable” methods Easy to implement, parallelizable.

5 Bagging Algorithm n n n 1. Create k bootstrap replicates of the dataset 2. Fit a model to each of the replicates 3. Average/vote the predictions of the k models

6 Boosting n Creating the model: n n Prediction: n n Construct a sequence of datasets and models in such a way that a dataset in the sequence weights an instance heavily when the previous model has misclassified it. “Merge” the models in the sequence Advantages: n Improves classification accuracy

7 Generic Boosting Algorithm n n 1. Equally weight all instance in dataset 2. For I = 1 to T n n 2. 1. Fit a model to current dataset 2. 2. Upweight poorly predicted instances 2. 3 Downweight well-predicted instances 3. Merge the models in the sequence to obtain the final model

8 Conclusions and References n n Boosted naïve Bayes tied for first place in KDD-cup 1997 Reference: n “Combining Estimators to Improve Performance” KDD-99 tutorial notes n n John F. Elder Greg Ridgeway