Bagging Boosting HITSCIRTM zkli March 24 2015 Outline

Bagging & Boosting HITSCIR-TM zkli-李泽魁 March. 24, 2015

Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈大社会计算与信息检索研究中心

Ensemble learning——Train Basic idea: if one classifier works well, why not use multiple classifiers! Training learning alg model 1 learning alg model 2 … Training Data learning alg 哈大社会计算与信息检索研究中心 model m

Ensemble learning——Test Basic idea: if one classifier works well, why not use multiple classifiers! Testing example to label model 1 prediction 1 model 2 prediction 2 … model m 哈大社会计算与信息检索研究中心 - take majority vote - if they output probabilities, take a weighted vote prediction m

Benefits of ensemble learning Assume each classifier makes a mistake with some probability (e. g. 0. 4, that is a 40% error rate) model 1 model 2 model 3 prob C C C . 6*. 6=0. 216 C C I . 6*. 4=0. 144 C I C . 6*. 4*. 6=0. 144 C I I . 6*. 4=0. 096 I C C . 4*. 6=0. 144 I C I . 4*. 6*. 4=0. 096 I I C . 4*. 6=0. 096 I . 4*. 4=0. 064 I I 哈大社会计算与信息检索研究中心 0. 096+ 0. 064 = 35% error!

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177 181 185 189 193 197 Given enough classifiers… 0. 45 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 哈大社会计算与信息检索研究中心 r = 0. 4

Obtaining independent classifiers learning alg model 1 learning alg model 2 … Training Data learning alg model m Where to we get m independent classifiers? 哈大社会计算与信息检索研究中心

Idea 1: different learning methods learning alg 1 learning alg 2 model 2 learning alg 3 decision tree k-nn perceptron … Training Data model 1 naïve bayes model m gradient descent variant 1 gradient descent variant 2 哈大社会计算与信息检索研究中心 … Pros/cons?

Idea 1: different learning methods Pros: l l Lots of existing classifiers already Can work well for some problems Cons/concerns: l Often, classifiers are not independent, that is, they make the same mistakes! n n e. g. many of these classifiers are linear models voting won’t help us if they’re making the same mistakes 哈大社会计算与信息检索研究中心

Idea 2: split up training data part 1 learning alg model 1 part 2 learning alg model 2 part m learning alg … … Training Data model m Use the same learning algorithm, but train on different parts of the training data 哈大社会计算与信息检索研究中心

Idea 2: split up training data Pros: l l l Learning from different data, so can’t overfit to same examples Easy to implement fast Cons/concerns: l l Each classifier is only training on a small amount of data Not clear why this would do any better than training on full data and using good regularization 哈大社会计算与信息检索研究中心

Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈大社会计算与信息检索研究中心

Idea 3: bagging Training Data 1 learning alg Training Data m 哈大社会计算与信息检索研究中心 model 1 … … Training Data learning alg model m

bagging sampling with replacements “Training” data 1 “Training” data 2 … Training data 哈大社会计算与信息检索研究中心 Use training data as a proxy for the data generating distribution

bagging create m “new” training data sets by sampling with replacement from the original training data set (called m “bootstrap” samples) train a classifier on each of these data sets to classify, take the majority vote from the m classifiers 哈大社会计算与信息检索研究中心

bagging overlap Training Data 1 … Training Data m 哈大社会计算与信息检索研究中心 Won’t these all be basically the same? On average, a randomly sampled data set will only contain 63% of the examples in the original

probability of overlap 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 1 3 5 7 9 11 13151719 21232527293133353739414345474951535557596163656769717375777981838587899193959799 Converges very quickly to 1 -1/e ≈ 63% 哈大社会计算与信息检索研究中心

When does bagging work Let’s say 10% of our examples are noisy (i. e. don’t provide good information) When bagging sampling data, a third of the original noisy examples not use for training For some classifiers that have trouble with noisy classifiers, this can help 哈大社会计算与信息检索研究中心

When does bagging work Bagging tends to reduce the variance of the classifier By voting, the classifiers are more robust to noisy examples Bagging is most useful for classifiers that are: l l Unstable: small changes in the training set produce very different models Prone to overfitting 哈大社会计算与信息检索研究中心

Review: Idea 1: different learning methods model 1 learning alg model 2 learning alg k-nn perceptron … Training Data decision tree naïve bayes model m gradient descent variant 1 gradient descent variant 2 哈大社会计算与信息检索研究中心 … classifiers are not independent?

Review: Idea 2: split up training data part 1 learning alg model 1 part 2 learning alg model 2 part m learning alg … … Training Data model m Use the same learning algorithm, but train on different parts of the training data Each classifier is only training on a small amount of data 哈大社会计算与信息检索研究中心

Review: Idea 3: bagging Training Data 1 learning alg Training Data m 哈大社会计算与信息检索研究中心 model 1 … … Training Data learning alg model m more robust to noisy examples

Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈大社会计算与信息检索研究中心

boosting basics Start with equal weighted examples Weights: Examples: E 1 E 2 E 3 Learn a weak classifier: 哈大社会计算与信息检索研究中心 E 4 E 5 weak 1

Boosting classified correct classified incorrect Weights: Examples: weak 1 E 2 E 3 E 4 E 5 We want to reweight the examples and then learn another weak classifier How should we change the example weights? 哈大社会计算与信息检索研究中心

Boosting Weights: Examples: E 1 E 2 E 3 Learn another weak classifier: 哈大社会计算与信息检索研究中心 E 4 weak 2 E 5

Boosting Weights: Examples: E 1 E 2 E 3 weak 2 哈大社会计算与信息检索研究中心 E 4 E 5

Boosting Weights: Examples: E 1 E 2 E 3 E 4 E 5 - decrease the weight for those we’re getting correct - increase the weight for those we’re getting incorrect 哈大社会计算与信息检索研究中心

Classifying weak 1 weak 2 … 哈大社会计算与信息检索研究中心 prediction 1 weighted vote based on how well they classify the training data prediction 2 weak_2_vote > weak_1_vote since it got more right

Ada. Boost: train for k = 1 to iterations: - classifierk = learn a weak classifier based on weights calculate weighted error for this classifier(加权分类误差) - calculate “score” for this classifier(分类器的系数) - change the example weights(权值的更新) - 哈大社会计算与信息检索研究中心

Ada. Boost: train update the example weights Remember, we want to enforce: normalizing constant (i. e. the sum of the “new” wi): 哈大社会计算与信息检索研究中心

Ada. Boost: train update the example weights correct positive incorrect negative correct incorrect small value large value Note: only change weights based on current classifier (not all previous classifiers) 哈大社会计算与信息检索研究中心

Ada. Boost例题 1 0 1 序号 x y n 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 Weak Classifier是Decision Stump l 根据x>v和x<v来分类哈大社会计算与信息检索研究中心 8 7 1 9 8 1 10 9 -1

序号 x y w 1 i 1 0. 1 2 1 1 0. 1 3 2 1 0. 1 4 3 -1 0. 1 n 计算Classifier 1的系数 n 更新训练数据的权值分布 n n n 5 4 -1 0. 1 6 5 -1 0. 1 7 6 1 0. 1 8 7 1 0. 1 9 8 1 0. 1 D 2=(0. 0715, 0. 0715, 0. 1666, 0. 0715) f 1(x)=0. 4236*Classifier 1(x) 分类器sign(f 1(x))在训练数据集上有3个误分类点哈大社会计算与信息检索研究中心 10 9 -1 0. 1

序号 1 x 0 y 1 0. 0715 w 2 i n 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 8 7 1 9 8 1 10 9 -1 0. 0715 0. 1666 0. 0715 哈大社会计算与信息检索研究中心

序号 x y w 3 i 1 0 1 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 8 7 1 9 8 1 10 9 -1 0. 0455 0. 1667 0. 1060 0. 0455 n M=3 …… n 直至分类器对所有样本分类正确 n n n f 3(x)=0. 4236*Classifier 1(x) + 0. 6496*Classifier 2(x)+0. 7514*Classifier 3(x) 分类器sign(f 3(x))在训练数据集上有0个误分类点哈大社会计算与信息检索研究中心

Boosting Tips n n Performance of Ada. Boost depends on data and weak learner Consistent with theory, Ada. Boost can fail if l l n weak classifier too complex– overfitting weak classifier too weak -- underfitting Empirically, Ada. Boost seems especially susceptible to uniform noise 哈大社会计算与信息检索研究中心

Other Boosting Method n 不同的损失函数和极小化损失函数方法决定了boosting的最终效果（图自 Machine Learning A Probabilistic Perspective）哈大社会计算与信息检索研究中心

Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈大社会计算与信息检索研究中心

Ensemble Methods 的应用 n n Random Forests Gradient Boosting Decision Trees 哈大社会计算与信息检索研究中心

Random Forests n n The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995 The method combines Breiman's "bagging" idea and the random selection of features 哈大社会计算与信息检索研究中心

Random Forests ——两个随机采样的过程 n 列采样(random selection of features) l l 对树的每一个节点，从M个feature中选择m维 (m << M) ，并计算最适合的分类点 Splits are chosen according to a purity measure: n n n squared error (regression) Gini index (classification) …… 哈大社会计算与信息检索研究中心

Compared with Boosting Pros: • It is more robust. • It is faster to train (no reweighting, each split is on a small subset of data and feature). • Can handle missing/partial data. • Is easier to extend to online version. Cons: • The feature selection process is not explicit. • Feature fusion is also less obvious. • Has weaker performance on small size training data. 哈大社会计算与信息检索研究中心

RF vs. GBDT n n Testing by Mllib in Spark Model size 哈大社会计算与信息检索研究中心

RF vs. GBDT n n Testing by Mllib in Spark Training data size 哈大社会计算与信息检索研究中心

参考资料 n n n n n A Course in Machine Learning, Hal Daumé III 统计学习方法, 李航机器学习_任彬_20150323. PPT bootstrap, boosting, bagging, leijuan_apple Ensemble learning, GJS CART, Bagging, Random Forest, Boosting, Rachel-Zhang Ada. Boost--从原理到实现, Dark_Scope Treelink模型测试报告, wujun Gbdt 迭代决策树入门教程, king Random Forests & Boosting in MLlib, Databricks Blog 哈大社会计算与信息检索研究中心