Sampling Approaches to Learning from Imbalanced Datasets Naoki

Sampling Approaches to Learning from Imbalanced Datasets Naoki Abe IBM T. J. Watson Research Center 10/15/2021 Based on joint work with Bianca Zadrozny University of California, San Diego Hiroshi Mamitsuka Kyoto University John Langfod, Edwin Pednault, Chid Apte et al IBM T. J. Watson Research Center 1

Outline l Introduction l l l Sampling Approaches to Learning from Imbalanced Dataset l l l Industrial applications and learning from imbalanced datasets Review: Past approaches to learning from imbalanced datasets Selective sampling based on query learning Sampling for cost-sensitive learning Discussion 10/15/2021 2

Industrial Applications and the Issue of Imbalanced Dataset l Industrial Applications l Hardware Fault Detection (e. g. Apte, Weiss, Grout 93) l l Insurance Risk Modeling (e. g. Pednault, Rosen, Apte ’ 00) l l Response is typically rare but can be profitable Churn Analysis (e. g. Mamitsuka and Abe ’ 00) l l Disease is typically rare but can be deadly Targeted Marketing (e. g. Zadrozny, Elkan ’ 01) l l Intrusion is rare but can be very costly Airline No-show Prediction (e. g. Lawrence, Hong, et al ’ 03) l l Fraud is rare but very costly Intrusion Detection (e. g. Chan et al) l l Claims are rare but very costly Fraud Detection (e. g. Fawcett and Provost ‘ 97) l l Faults are rare but very costly Churn is typically rare but quite costly The bottom line – all involve imbalanced data set 10/15/2021 3

Review: Past Approaches l Algorithm Specific Approach l l l Cost-sensitive Learning Approach l l l Modifying specific learning algorithms to handle imbalanced data set Using model class appropriate for modeling imbalanced data Imbalanced dataset is a problem because rare class tends to be more costly … Learning from imbalanced dataset is an instance of costsensitive learning ? Query/Active Learning Approach l l 10/15/2021 Imbalanced dataset is a problem because it provides little information about decision boundary Learning from imbalanced dataset is an instance of query/active learning ? 4

Algorithmic Specific Approach: A Case Study from Underwriting Profitability Analysis (UPA) l l l Partnership between IBM and Farmers Insurance Group UPA predicts the expected claim amount paid per unit time (i. e. pure premium) as a function of risk factors. Project led to creation of IBM Prob. E data mining engine: l l l 10/15/2021 General framework for tree-based modeling Allows arbitrary model class at leaves Allows modification of splitting/pruning rule 5

Model Class for Imbalanced Dataset l l l Claim amounts modeled by lognormal distribution Claim frequency modeled by poisson distributions Node splitting allowed only if standard error is small: 10/15/2021 This approach makes sense if a complete model is desired 6

Outline l Introduction l l l Sampling Approach to Learning from Imbalanced Dataset l l l Industrial applications and learning from imbalanced datasets Review: Past approaches to learning from imbalanced datasets Selective sampling based on query learning Sampling for cost-sensitive learning Discussion 10/15/2021 7

Selective Sampling based on Query Learning l Active/Query Learning (e. g. Angluin 88) l l Uncertainty Sampling (e. g. Lewis and Gale 94) l l Learner gets to choose examples on which to request the labels Learner queries examples for which its prediction so far is uncertain to maximize information gain A prime example is Query by committee (Seung et al 92), which queries examples on which models obtained so far disagree on Successfully applied to getting labeled data for text classification Selective sampling by query learning (e. g. Freund et al 93) l l 10/15/2021 Given large number of (possibly unlabeled) data, uses only small subset of labeled data for learning Successfully applied to mining very large data set (e. g. Mamitsuka and Abe ’ 00), even when labeled data are abound 8

Why query learning for imbalanced data set ? l l Use query learning to get more data for rare classes Use selective sampling by query learning to get data near the decision boundary Case 1 Case 2 Case 3 10/15/2021 + - =0 + Label + -- Label = 1 + ++ + - Label = 0 9

Selective Sampling by Query Learning: Overview l l l Calculate the “uncertainty” of examples based on the results of the previous iterations. In the next iteration, select the examples which are most uncertain. model A commonly used measure of uncertainty for classification sample 1 is the margin. 10/15/2021 1 0 0 0 1 1 1 0 0 3 0 0 0 5 1 1 0 3 margin data uncertainty measure model sample I+1 10

Qbag. S: Query by Bagging (Mamitsuka and Abe, 00) Qbag. S (Learner A, Sample Set T, sample size s, count t) (1) For i=1 to t do (a) T’ = minimum margin sub-sample of size s from T (b) Let hi = A(T’ ) (2) Output h(x) = sign( ) It belongs to “sequential multi-subset learning with model-guided instance selection methods” (Provost and Kolluri ’ 99) 10/15/2021 11 l What about “boosting” and other related approaches ? l

Ivotes: Importance Sampling (Breiman, 00) Ivotes (Learner A, Sample Set T, sample size s, count t) (1) For i=1 to t do (a) T’ = importance sample of size s from T (accepted with probability 1 if current hypothesis predicts wrongly, and with probability e/(1 -e) otherwise*) (b) Let hi = A(T’ ) (2) Output h(x) = sign( ) * e = error rate of current hypothesis 10/15/2021 12

Empirical Comparison between Qbag. S and Ivotes l l Medium sized data sets (Generator from Agrawal, 93) Large sized real world data set (Churn from NEC) 10/15/2021 13

Empirical Comparison between Qbag. S and Ivotes (II) l l Medium sized data sets (Generator from Agrawal, 93) Large sized real world data set (Churn from NEC) l l l This is an imbalanced data set (class 1 = churn is roughly 10%) Cost of retention (C 0) is smaller than cost of churn (C 1) Measured performance using Precision and Recall 10/15/2021 14

The Precision-Recall Measure l l Precision-Recall often used as evaluation metric for learning algorithm Precision = P(correct | pred = 1) Recall = P(correct | true = 1) Provides measure for a whole range of relative cost of false positives and false negatives 10/15/2021 15

Precision-Recall and Cost Minimization l Let l l Then l l l F 0, F 1 = Frequency of class 0 (1) C 0 , C 1 = Cost when true class is 0 (1) P 0 , R = Precision when true class 0 (1) Expected cost = F 1(1 -R)C 1+ F 0(1 -P 0)C 0 = K 1+ F 1(C 1 - C 0)R + C 0 P Assuming slope of PR-curve = -1, cost is decreased by l Increasing R if F 1 C 1 > F 0 C 0 l Increasing P if F 1 C 1 < F 0 C 0 Query Learning, or more in general ensemble learning, provides a solution for a whole range of cost-landscape, by virtue of its ranking w. r. t. confidence of prediction 10/15/2021 16

Outline l Introduction l l l Industrial applications and learning from imbalanced datasets Review: Past approaches to learning from imbalanced datasets Sampling Approach to Learning from Imbalanced Dataset l l l 10/15/2021 Selective sampling based on query learning Sampling for cost-sensitive learning Cost-sensitive query learning ? 17

Cost-sensitive Learning l Traditionally assumed a cost matrix of the form: Predict = 0 Predict = 1 l True = 0 True = 1 C(0, 0) C(1, 0) C(0, 1) C(1, 1) Zadrozny and Elkan ’ 01 introduced cost that depends on particular example x Predict = 0 Predict = 1 10/15/2021 True = 0 True = 1 C(0, 0, x) C(1, 0, x) C(0, 1, x) C(1, 1, x) 18

Cost-sensitive learning by cost proportionate weighted sampling (Zadrozny, Langford, Abe ‘ 03) l Presents reduction of cost-sensitive learning to classification l l With theoretical performance guarantee Uses cost-proportionate rejection sampling Proposes Costing (cost-sensitive ensemble learning) Empirical evaluation using benchmark data sets from targeted marketing domain l l 10/15/2021 Costing achieves excellent predictive performance (w. r. t. cost minimization) Costing is computationally efficient 19

Translation Theorem l Assume examples (x, y, c) are drawn i. i. d. from some distribution D over X x Y x R l l where c = C(1 -y, y, x) – C(y, y, x), i. e. the opportunity cost for misclassifying x Let then l h minimizing expected classification error rate for minimize expected cost with respect to 10/15/2021 . will 20

Cost-proportionate sampling l Two methods for weighted sampling l Sampling with replacement (for some chosen sample size) from T with probabilities sampling from T with the same probabilities, i. e. l Rejection l With probability p(x, y, c), accept the example l Otherwise reject the example l Continue sampling from T 10/15/2021 21

Sample complexity of costproportionate rejection sampling l l Define to be the worst-case sample complexity for achieving approximately optimal cost with high probability Then define to be the sample complexity of using original sample And define to be the sample complexity of using cost-proportionate rejection sampling. Then the following holds: Cost-proportionate rejection sampling distills cost-sensitive information in the original sample to a much smaller one 22 10/15/2021 l

Costing – Cost-based bagging Costing (Learner A, Sample Set T, count t) (1) For i=1 to t do (a) T’ = cost-proportionate rejection sample from T (b) Let hi = A(T’ ) (2) Output h(x) = sign( 10/15/2021 ) 23

Costing results: KDD-98 l l l Each set has ~600 examples, of which ~55% are positive. Costing with C 4. 5 achieves state-of-theart profit. Similar, though less impressive, behavior observed for SVM and Naïve Bayes 10/15/2021 Test Set Net Profit C 4. 5 24

Experimental results: Summary l l l Experiments using 2 targeted marketing datasets Generally, resampling performs poorly and costing performs well Extremely poor performance of resampling with C 4. 5 is thought to be caused by overfitting due to duplicate examples hindering the complexity control mechanism of C 4. 5 10/15/2021 KDD-98: Method Costing (200) Resampli ng (100 k) NB $13163 $12026 Boosted NB $14714 $13135 C 4. 5 $15016 $2259 SVMLight $13152 $12808 DMEF-2: Method Costing (200) Resampli ng(100 k) NB $37629 $34506 Boosted NB $37891 $31889 C 4. 5 $37500 $3149 SVMLight $35290 $33674 25

Outline l Introduction l l l Sampling Approach to Learning from Imbalanced Dataset l l l Industrial applications and learning from imbalanced datasets Review: Past approaches to learning from imbalanced datasets Selective sampling based on query learning Sampling for cost-sensitive learning Discussion 10/15/2021 26

Learning from Imbalanced Dataset as Cost-sensitive Learning l l Cost-proportionate weighted sampling solves cost -sensitive learning, and hence learning from imbalanced dataset Cost proportionate rejection sampling and Resampling with replacement correspond to under-sampling and over-sampling Under-sampling and over-sampling are the special case in which F 1 C 1 = F 0 C 0 and where P=R is optimal (assuming slope of PR-curve = -1) “Rejection sampling > Resampling” is consistent with and generalizes “Undersampling > Oversampling” 10/15/2021 27

Imbalanced Dataset, Costsensitivity, Query Learning l l l Generalization of learning from imbalanced dataset as costsensitive learning lead to better understanding and more general solution Generalization of learning from imbalanced dataset as Query learning offers an alternative solution, which is valid for a whole range of cost landscape Sampling approach (derived from cost sensitive and query learning) addresses the issue of imbalanced dataset, while it also provides solution with improved computational efficiency ! Cost-sensitive Learning Query Learning Cost-proportionate sampling 10/15/2021 Learning from Imbalanced Dataset Uncertainty sampling 28

References l l l “Handling imbalanced datasets in insurance risk modeling”, E. Pednault, B. Rosen, C. Apte, Learning from Imbalanced Datasets: Papers from AAAI Workshop, The AAAI Press, 2000. (Also available as IBM Research Report RC-21731. ) “Efficient mining from large databases by query learning”, H. Mamitsuka and N. Abe, Prof. of the Sixteenth Int’l Conf. on Machine Learning (ICML’ 00). “Cost-sensitive learning by cost-proportionate example weighting, ” B. Zadrozny, J. Langford, N. Abe, Proc. Of the Third IEEE Int’l Conf. on Data Mining (ICDM’ 03), to appear. 10/15/2021 29