TAMING THE LEARNING ZOO SUPERVISED LEARNING ZOO Bayesian

TAMING THE LEARNING ZOO

SUPERVISED LEARNING ZOO Bayesian learning (find parameters of a probabilistic model) � Maximum likelihood � Maximum a posteriori Classification � Decision trees (discrete attributes, few relevant) � Support vector machines (continuous attributes) Regression � Least squares (known structure, easy to interpret) � Neural nets (unknown structure, hard to interpret) Nonparametric approaches � k-Nearest-Neighbors � Locally-weighted averaging / regression 2

VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS Task Attributes N scalability D scalability Capacity Bayes nets C D Good Naïve Bayes C D Excellent Low Decision trees C D, C Excellent Fair Linear least squares R C Excellent Low Nonlinear LS R C Poor Good Neural nets R C Poor Good SVMs C C Good Nearest neighbors C D, C L: E, E: P Poor Excellent* Locallyweighted averaging R C L: E, E: P Poor Excellent* Boosting C D, C ? ? Excellent*

VERY APPROXIMATE “CHEAT-SHEET” FOR TECHNIQUES DISCUSSED IN CLASS Task Attributes N scalability D scalability Capacity Bayes nets C D Good Note: we have looked at a limited subset of existing Naïve Bayes C in this D class (typically, Excellent Low techniques the “classical” versions). Decision trees C D, C Excellent Fair Most techniques extend to: Linear least R C Excellent Low • Both C/R tasks (e. g. , support vector regression) squares • Both continuous and discrete attributes Nonlinear LS C for certain Poor Good • Better. Rscalability types of problem Neural nets R C Poor Good SVMs C C Good Nearest neighbors C D, C L: E, E: P Poor Excellent* Locallyweighted averaging R With large” data sets C “sufficiently Good Poor Excellent* Boosting C With “sufficiently D, C ? ? diverse” weak leaners Excellent*

AGENDA Quantifying learner performance � Cross validation � Error vs. loss � Precision & recall Model selection

CROSS-VALIDATION

ASSESSING PERFORMANCE OF A LEARNING ALGORITHM Samples from X are typically unavailable Take out some of the training set � Train on the remaining training set � Test on the excluded instances � Cross-validation

CROSS-VALIDATION Split original set of examples, train Examples D - + - - - + + Train + + + Hypothesis space H

CROSS-VALIDATION Evaluate hypothesis on testing set Testing set - - - + + + + - + Hypothesis space H

CROSS-VALIDATION Evaluate hypothesis on testing set Testing set - + + - + Test + - Hypothesis space H

CROSS-VALIDATION Compare true concept against prediction 9/13 correct Testing set - -- + ++ ++ -+ ++ ++ +- -+ -++ -- Hypothesis space H

COMMON SPLITTING STRATEGIES k-fold cross-validation Dataset Train Test

COMMON SPLITTING STRATEGIES k-fold cross-validation Dataset Train Leave-one-out (n-fold cross validation) Test

COMPUTATIONAL COMPLEXITY k-fold cross validation requires �k training steps on n(k-1)/k datapoints � k testing steps on n/k datapoints � (There are efficient ways of computing L. O. O. estimates for some nonparametric techniques, e. g. Nearest Neighbors) Average results reported

BOOTSTRAPPING Similar technique for estimating the confidence in the model parameters Procedure: 1. Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. 2. Fit the model for each dataset to compute parameters k 3. Return the standard deviation of 1, …, k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)

SIMPLE EXAMPLE: AVERAGE OF N NUMBERS Data D={x(1), …, x(N)}, model is constant Learning: minimize E( ) = i(x(i)- )2 => compute average Repeat for j=1, …, k : � Randomly sample subset x(1)’, …, x(N)’ from D � Learn j = 1/N i x(i)’ Return histogram of 1, …, j 0. 57 0. 55 0. 53 0. 51 Average 0. 49 Lower range 0. 47 Upper range 0. 45 0. 43 10 1000 |Data set| 10000

BEYOND ERROR RATES 17

BEYOND ERROR RATE Predicting security risk � Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them) Searching for images � Returning irrelevant images is worse than omitting relevant ones 18

BIASED SAMPLE SETS Often there are orders of magnitude more negative examples than positive E. g. , all images of Kris on Facebook If I classify all images as “not Kris” I’ll have >99. 99% accuracy Examples of Kris should count much more than non -Kris!

FALSE POSITIVES True concept Learned concept x 2 x 1 20

An example incorrectly predicted to be positive FALSE POSITIVES True concept Learned concept x 2 New query x 1 21

An example incorrectly predicted to be negative FALSE NEGATIVES True concept Learned concept x 2 New query x 1 22

PRECISION VS. RECALL Precision �# of relevant documents retrieved / # of total documents retrieved Recall �# of relevant documents retrieved / # of total relevant documents Numbers between 0 and 1 23

PRECISION VS. RECALL Precision �# of true positives / (# true positives + # false positives) Recall �# of true positives / (# true positives + # false negatives) A precise classifier is selective A classifier with high recall is inclusive 24

REDUCING FALSE POSITIVE RATE True concept Learned concept x 2 x 1 25

REDUCING FALSE NEGATIVE RATE True concept Learned concept x 2 x 1 26

PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Recall Perfect classifier Actual performance 27 Precision

PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Recall Penalize false negatives Equal weight Penalize false positives 28 Precision

PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Recall 29 Precision

PRECISION-RECALL CURVES Measure Precision vs Recall as the classification boundary is tuned Recall Better learning performance 30 Precision

OPTION 1: CLASSIFICATION THRESHOLDS Many learning algorithms (e. g. , linear models, NNets, BNs, SVM) give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x May want to tune threshold to get fewer false positives or false negatives 31

OPTION 2: LOSS FUNCTIONS &WEIGHTED DATASETS General learning problem: “Given data D and loss function L, find the hypothesis from hypothesis class H that minimizes L” Loss functions: L may contain weights to favor accuracy on positive or negative examples � E. g. , L = 10 E+ + 1 E- Weighted datasets: attach a weight w to each example to indicate how important it is � Or construct a resampled dataset D’ where each example is duplicated proportionally to its w

MODEL SELECTION AND REGULARIZATION

COMPLEXITY VS. GOODNESS OF FIT More complex models can fit the data better, but can overfit Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off Regularization: explicitly define a metric of complexity and penalize it in addition to loss

MODEL SELECTION WITH K-FOLD CROSSVALIDATION Parameterize learner by a complexity level C Model selection pseudocode: � For increasing levels of complexity C: err. T[C], err. V[C] = Cross-Validate(Learner, C, examples) If err. T has converged, Find value Cbest that minimizes err. V[C] Return Learner(Cbest, examples)

REGULARIZATION Minimize: � Cost(h) = Loss(h) + Complexity(h) Example with linear models y = Tx: error: Loss( ) = i (y(i)- Tx(i))2 � Lq regularization: Complexity( ): j | j|q � L 2 and L 1 are most popular in linear regularization � L 2 regularization leads to simple computation of optimal L 1 is more complex to optimize, but produces sparse models in which many coefficients are 0!

DATA DREDGING As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases In the extreme case where there are more attributes than datapoints (e. g. , pixels in a video), even very simple hypothesis classes can overfit � E. g. , linear classifiers Many opportunities for charlatans in the big data age!

OTHER TOPICS IN MACHINE LEARNING Unsupervised learning � Dimensionality reduction � Clustering Reinforcement learning � Agent that acts and learns how to act in an environment by observing rewards Learning from demonstration Agent that learns how to act in an environment by observing demonstrations from an expert 38

ISSUES IN PRACTICE The distinctions between learning algorithms diminish when you have a lot of data The web has made it much easier to gather large scale datasets than in early days of ML Understanding data with many more attributes than examples is still a major challenge! � Do humans just have really great priors?

NEXT LECTURES Temporal sequence models (R&N 15) Decision-theoretic planning Reinforcement learning Applications of AI