Bagging Boosting HITSCIRTM zkli March 24 2015 Outline
Bagging & Boosting HITSCIR-TM zkli-李泽魁 March. 24, 2015
Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
Ensemble learning——Train Basic idea: if one classifier works well, why not use multiple classifiers! Training learning alg model 1 learning alg model 2 … Training Data learning alg 哈 大社会计算与信息检索研究中心 model m
Ensemble learning——Test Basic idea: if one classifier works well, why not use multiple classifiers! Testing example to label model 1 prediction 1 model 2 prediction 2 … model m 哈 大社会计算与信息检索研究中心 - take majority vote - if they output probabilities, take a weighted vote prediction m
Benefits of ensemble learning Assume each classifier makes a mistake with some probability (e. g. 0. 4, that is a 40% error rate) model 1 model 2 model 3 prob C C C . 6*. 6=0. 216 C C I . 6*. 4=0. 144 C I C . 6*. 4*. 6=0. 144 C I I . 6*. 4=0. 096 I C C . 4*. 6=0. 144 I C I . 4*. 6*. 4=0. 096 I I C . 4*. 6=0. 096 I . 4*. 4=0. 064 I I 哈 大社会计算与信息检索研究中心 0. 096+ 0. 064 = 35% error!
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177 181 185 189 193 197 Given enough classifiers… 0. 45 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 哈 大社会计算与信息检索研究中心 r = 0. 4
Obtaining independent classifiers learning alg model 1 learning alg model 2 … Training Data learning alg model m Where to we get m independent classifiers? 哈 大社会计算与信息检索研究中心
Idea 1: different learning methods learning alg 1 learning alg 2 model 2 learning alg 3 decision tree k-nn perceptron … Training Data model 1 naïve bayes model m gradient descent variant 1 gradient descent variant 2 哈 大社会计算与信息检索研究中心 … Pros/cons?
Idea 1: different learning methods Pros: l l Lots of existing classifiers already Can work well for some problems Cons/concerns: l Often, classifiers are not independent, that is, they make the same mistakes! n n e. g. many of these classifiers are linear models voting won’t help us if they’re making the same mistakes 哈 大社会计算与信息检索研究中心
Idea 2: split up training data part 1 learning alg model 1 part 2 learning alg model 2 part m learning alg … … Training Data model m Use the same learning algorithm, but train on different parts of the training data 哈 大社会计算与信息检索研究中心
Idea 2: split up training data Pros: l l l Learning from different data, so can’t overfit to same examples Easy to implement fast Cons/concerns: l l Each classifier is only training on a small amount of data Not clear why this would do any better than training on full data and using good regularization 哈 大社会计算与信息检索研究中心
Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
Idea 3: bagging Training Data 1 learning alg Training Data m 哈 大社会计算与信息检索研究中心 model 1 … … Training Data learning alg model m
bagging sampling with replacements “Training” data 1 “Training” data 2 … Training data 哈 大社会计算与信息检索研究中心 Use training data as a proxy for the data generating distribution
bagging create m “new” training data sets by sampling with replacement from the original training data set (called m “bootstrap” samples) train a classifier on each of these data sets to classify, take the majority vote from the m classifiers 哈 大社会计算与信息检索研究中心
bagging overlap Training Data 1 … Training Data m 哈 大社会计算与信息检索研究中心 Won’t these all be basically the same? On average, a randomly sampled data set will only contain 63% of the examples in the original
probability of overlap 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 1 3 5 7 9 11 13151719 21232527293133353739414345474951535557596163656769717375777981838587899193959799 Converges very quickly to 1 -1/e ≈ 63% 哈 大社会计算与信息检索研究中心
When does bagging work Let’s say 10% of our examples are noisy (i. e. don’t provide good information) When bagging sampling data, a third of the original noisy examples not use for training For some classifiers that have trouble with noisy classifiers, this can help 哈 大社会计算与信息检索研究中心
When does bagging work Bagging tends to reduce the variance of the classifier By voting, the classifiers are more robust to noisy examples Bagging is most useful for classifiers that are: l l Unstable: small changes in the training set produce very different models Prone to overfitting 哈 大社会计算与信息检索研究中心
Review: Idea 1: different learning methods model 1 learning alg model 2 learning alg k-nn perceptron … Training Data decision tree naïve bayes model m gradient descent variant 1 gradient descent variant 2 哈 大社会计算与信息检索研究中心 … classifiers are not independent?
Review: Idea 2: split up training data part 1 learning alg model 1 part 2 learning alg model 2 part m learning alg … … Training Data model m Use the same learning algorithm, but train on different parts of the training data Each classifier is only training on a small amount of data 哈 大社会计算与信息检索研究中心
Review: Idea 3: bagging Training Data 1 learning alg Training Data m 哈 大社会计算与信息检索研究中心 model 1 … … Training Data learning alg model m more robust to noisy examples
Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
boosting basics Start with equal weighted examples Weights: Examples: E 1 E 2 E 3 Learn a weak classifier: 哈 大社会计算与信息检索研究中心 E 4 E 5 weak 1
Boosting classified correct classified incorrect Weights: Examples: weak 1 E 2 E 3 E 4 E 5 We want to reweight the examples and then learn another weak classifier How should we change the example weights? 哈 大社会计算与信息检索研究中心
Boosting Weights: Examples: E 1 E 2 E 3 Learn another weak classifier: 哈 大社会计算与信息检索研究中心 E 4 weak 2 E 5
Boosting Weights: Examples: E 1 E 2 E 3 weak 2 哈 大社会计算与信息检索研究中心 E 4 E 5
Boosting Weights: Examples: E 1 E 2 E 3 E 4 E 5 - decrease the weight for those we’re getting correct - increase the weight for those we’re getting incorrect 哈 大社会计算与信息检索研究中心
Classifying weak 1 weak 2 … 哈 大社会计算与信息检索研究中心 prediction 1 weighted vote based on how well they classify the training data prediction 2 weak_2_vote > weak_1_vote since it got more right
Ada. Boost: train for k = 1 to iterations: - classifierk = learn a weak classifier based on weights calculate weighted error for this classifier(加权分类误差) - calculate “score” for this classifier(分类器的系数) - change the example weights(权值的更新) - 哈 大社会计算与信息检索研究中心
Ada. Boost: train update the example weights Remember, we want to enforce: normalizing constant (i. e. the sum of the “new” wi): 哈 大社会计算与信息检索研究中心
Ada. Boost: train update the example weights correct positive incorrect negative correct incorrect small value large value Note: only change weights based on current classifier (not all previous classifiers) 哈 大社会计算与信息检索研究中心
Ada. Boost例题 1 0 1 序号 x y n 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 Weak Classifier是Decision Stump l 根据x>v和x<v来分类 哈 大社会计算与信息检索研究中心 8 7 1 9 8 1 10 9 -1
序号 x y w 1 i 1 0. 1 2 1 1 0. 1 3 2 1 0. 1 4 3 -1 0. 1 n 计算Classifier 1的系数 n 更新训练数据的权值分布 n n n 5 4 -1 0. 1 6 5 -1 0. 1 7 6 1 0. 1 8 7 1 0. 1 9 8 1 0. 1 D 2=(0. 0715, 0. 0715, 0. 1666, 0. 0715) f 1(x)=0. 4236*Classifier 1(x) 分类器sign(f 1(x))在训练数据集上有3个误分类点 哈 大社会计算与信息检索研究中心 10 9 -1 0. 1
序号 1 x 0 y 1 0. 0715 w 2 i n 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 8 7 1 9 8 1 10 9 -1 0. 0715 0. 1666 0. 0715 哈 大社会计算与信息检索研究中心
序号 x y w 3 i 1 0 1 2 1 1 3 2 1 4 3 -1 5 4 -1 6 5 -1 7 6 1 8 7 1 9 8 1 10 9 -1 0. 0455 0. 1667 0. 1060 0. 0455 n M=3 …… n 直至分类器对所有样本分类正确 n n n f 3(x)=0. 4236*Classifier 1(x) + 0. 6496*Classifier 2(x)+0. 7514*Classifier 3(x) 分类器sign(f 3(x))在训练数据集上有0个 误分类点 哈 大社会计算与信息检索研究中心
Boosting Tips n n Performance of Ada. Boost depends on data and weak learner Consistent with theory, Ada. Boost can fail if l l n weak classifier too complex– overfitting weak classifier too weak -- underfitting Empirically, Ada. Boost seems especially susceptible to uniform noise 哈 大社会计算与信息检索研究中心
Other Boosting Method n 不同的损失函数和极小化损失函数方法决 定了boosting的最终效果 (图自 Machine Learning A Probabilistic Perspective) 哈 大社会计算与信息检索研究中心
Outline n n n Ensemble Methods Bagging Boosting Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
Ensemble Methods 的应用 n n Random Forests Gradient Boosting Decision Trees 哈 大社会计算与信息检索研究中心
Random Forests n n The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995 The method combines Breiman's "bagging" idea and the random selection of features 哈 大社会计算与信息检索研究中心
Random Forests ——两个随机采样的过程 n 列采样(random selection of features) l l 对树的每一个节点,从M个feature中选择m维 (m << M) ,并计算最适合的分类点 Splits are chosen according to a purity measure: n n n squared error (regression) Gini index (classification) …… 哈 大社会计算与信息检索研究中心
Compared with Boosting Pros: • It is more robust. • It is faster to train (no reweighting, each split is on a small subset of data and feature). • Can handle missing/partial data. • Is easier to extend to online version. Cons: • The feature selection process is not explicit. • Feature fusion is also less obvious. • Has weaker performance on small size training data. 哈 大社会计算与信息检索研究中心
RF vs. GBDT n n Testing by Mllib in Spark Model size 哈 大社会计算与信息检索研究中心
RF vs. GBDT n n Testing by Mllib in Spark Training data size 哈 大社会计算与信息检索研究中心
参考资料 n n n n n A Course in Machine Learning, Hal Daumé III 统计学习方法, 李航 机器学习_任彬_20150323. PPT bootstrap, boosting, bagging, leijuan_apple Ensemble learning, GJS CART, Bagging, Random Forest, Boosting, Rachel-Zhang Ada. Boost--从原理到实现, Dark_Scope Treelink模型测试报告, wujun Gbdt 迭代决策树入门教程, king Random Forests & Boosting in MLlib, Databricks Blog 哈 大社会计算与信息检索研究中心
- Slides: 75