Goodfellow Chap 5 Machine Learning Basics Dr Charles

Goodfellow: Chap 5 Machine Learning Basics Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.

Chapter Sections n n n Introduction 1 Learning Algorithms 2 Capacity, Overfitting and Underfitting 3 Hyperparameters and Validation Sets 4 Estimators, Bias and. Variance 5 Maximum Likelihood Estimation 6 Bayesian Statistics (not covered in this course) 7 Supervised Learning Algorithms 8 Unsupervised Learning Algorithms (later in course) 9 Stochastic Gradient Descent 10 Building a Machine Learning Algorithm 11 Challenges Motivating Deep Learning

Introduction n Machine Learning n n Two central approaches n n n Frequentist estimators and Bayesian inference Two categories of machine learning n n Form of applied statistics with extensive use of computers to estimate complicated functions Supervised learning and unsupervised learning Most machine learning algorithms based on stochastic gradient descent optimization Focus on building machine learning algorithms

1 Learning Algorithms n n A machine learning algorithm learns from data A computer program learns from n n experience E with respect to some class of tasks T and performance measure P if it’s performance P on tasks T improves with experience E

1. 1 The Task, T n n Machine learning tasks are too difficult to solve with fixed programs designed by humans Machine learning tasks are usually described in terms of how the system should process an example where n n An example is a collection of features measured from an object or event Many kinds of tasks n Classification, regression, transcription, anomaly detection, imputation of missing values, etc.

1. 2 The Performance Measure, P n n Performance P is usually specific to the Task T For classification tasks, P is usually accuracy n n The proportion of examples correctly classified Here we are interested in the accuracy on data not seen before – a test set

1. 3 The Experience, E n n Most learning algorithms in this book are allowed to experience an entire dataset Unsupervised learning algorithms n n Learn useful structural properties of the dataset Supervised learning algorithms n Experience a dataset containing features with each example associated with a label or target n n For example, target provided by a teacher Blur between supervised and unsupervised

1. 4 Example: Linear Regression n Linear regression takes a vector x as input and predicts the value of a scalar y as it’s output n n Task: predict scalar y from vector x Experience: set of vectors X and vector of targets y Performance: mean squared error => normal eqs Solution of the normal equations is

Pseudoinverse Method Intro to Algorithms by Cormen, et al. n n General linear regression Assume polynomial functions Minimizing mean squared error => normal eqs Solution of the normal equations is

Pseudoinverse Method Intro to Algorithms by Cormen, et al.

Example: Pseudoinverse Method n Simple linear regression homework example n Find n Given points

Example: Pseudoinverse Method

Linear Regression 3 Linear regression example 0. 55 0. 50 2 0. 45 MSE(train) 1 y Optimization of w 0 -1 0. 40 0. 35 0. 30 -2 0. 25 -3 0. 20 -1. 0 -0. 5 0. 0 x 1 0. 5 1. 0 w 1 1. 5 Figure 5. 1 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting n n n The central challenge in machine learning is to perform well on new, previously unseen inputs, and not just on the training inputs We want the generalization error (test error) to be low as well as the training error This is what separates machine learning from optimization

2 Capacity, Overfitting and Underfitting n n How can we affect performance on the test set when we get to observe only the training set? Statistical learning theory provides answers We assume train and test sets are generated independent and identically distributed – i. i. d. Thus, expected train and test errors are equal

2 Capacity, Overfitting and Underfitting n n Because the model is developed from the training set the expected test error is usually greater than the expected training error Factors determining machine performance are n n n Make the training error small Make the gap between train and test error small These two factors correspond to the two challenges in machine learning n Underfitting and overfitting

2 Capacity, Overfitting and Underfitting n Underfitting and overfitting can be controlled somewhat by altering the model’s capacity n Capacity = model’s ability to fit variety of functions n n n Low capacity models struggle to fit the training data High capacity models can overfit the training data One way to control the capacity of a model is by choosing its hypothesis space n For example, simple linear vs quadratic regression

Underfitting and Overfitting in Polynomial Estimation x 0 Overfitting y Appropriate capacity y y Underfitting x 0 Figure 5. 2 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting n n n Model capacity can be changed in many ways – above we changed the number of parameters We can also change the family of functions – called the model’s representational capacity Statistical learning theory provides various means of quantifying model capacity n n The most well-known is the Vapnik-Chervonenkis dimension Old non-statistical Occam’s razor method takes simplest of competing hypotheses

2 Capacity, Overfitting and Underfitting n n n While simpler functions are more likely to generalize (have a small gap between training and test error) we still want low training error Training error usually decreases until it asymptotes to the minimum possible error Generalization error usually has a U-shaped curve as a function of model capacity

Generalization and Capacity Overfitting zone Error Underfitting zone Training error Generalization gap 0 Optimal Capacity Figure 5. 3 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting n n n Non-parametric models can reach arbitrarily high capacities For example, the complexity of the nearest neighbor algorithm is a function of the training set size Training and generalization error vary as the size of the training set varies n Test error decreases with increased training set size

Training Set Size Error (MSE) 3. 5 3. 0 Bayes error 2. 5 Train (quadratic) Test (optimal capacity) 2. 0 1. 5 Train (optimal capacity) 1. 0 0. 5 0. 0 100 Figure 5. 4 101 102 103 104 105 Optimal capacity (polynomial degree) Number of training examples 20 15 10 5 0 101 102 103 Number of training examples (Goodfellow 2016)

2. 1 No Free Lunch Theorem n n n Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points In other words, no machine learning algorithm is universally any better than any other However, by making good assumptions about the encountered probability distributions, we can design high-performing learning algorithms n This is the goal of machine learning research

2. 2 Regularization n The no free lunch theorem implies that we design our learning algorithm on a specific task Thus, we build preferences into the algorithm Regularization is a way to do this and weight decay is a method of regularization

2. 2 Regularization n Applying weight decay to linear regression n This expresses a preference for smaller weights n n l controls the preference strength with l = 0 imposing no preference strength The next figure shows 9 th degree polynomial training with various values of l n The true function is quadratic

Weight Decay x( Overfitting (λ →() Appropriate weight decay (Medium λ) y y y Underfitting (Excessive λ) x( Figure 5. 5 x( (Goodfellow 2016)

2. 2 Regularization n More generally, we can add a penalty called a regularizer to the cost function Expressing preferences for one function over another is a more general way of controlling a model’s capacity, rather than including or excluding members from the hypothesis space Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error

3 Hyperparameters and Validation Sets n n Most machine learning algorithms have several settings, called hyperparameters, to control the behavior of the algorithm Examples n n In polynomial regression, the degree of the polynomial is a capacity hyperparameter In weight decay, l is a hyperparameter

3 Hyperparameters and Validation Sets n To avoid overfitting during training, the training set is often split into two sets n n n Called the training set and the validation set Typically, 80% of the training data is used for training and 20% for validation Cross-Validation n For small datasets, k-fold cross validation partitions the data into k non-overlapping subsets n k trials are run, train on (k-1)/k of the data, test on 1/k n Test error is estimated by averaging error on each run

4 Estimators, Bias and Variance n 4. 1 Point estimation n Prediction of quantity – e. g. , max likelihood n 4. 2 Bias n 4. 3 Variance and Standard Error

4 Estimators, Bias and Variance n 4. 4 Trading off Bias and Variance to Minimize Mean Squared Error n n Most common way to negotiate this trade-off is to use cross-validation Can also use MSE of the estimates n The relationship between bias and variance is tightly linked to capacity, underfitting, and overfitting

Bias and Variance Underfitting zone Bias Overfitting zone Generalization error Optimal capacity Figure 5. 6 Varianc e Capacity (Goodfellow 2016)

5 Maximum Likelihood Estimation n Covered in Duda textbook

6 Bayesian Statistics n Not covered in this course

7 Supervised Learning Algorithms n 7. 1 Probabilistic Supervised Learning n n 7. 2 Support Vector Machines n n Bayes decision theory Also covered by Duda – maybe later in semester 7. 3 Other Simple Supervised Learning Alg. n k-nearest neighbor algorithm n Decision trees – breaks input space into regions

Decision Trees 0 00 1 10 01 11 010 01 1 11 0 11 1 00 111 0 01 111 1 0 011 110 1 11 10 1110 Figure 5. 7 1111 (Goodfellow 2016)

8 Unsupervised Learning Algorithms n n Algorithms with unlabeled examples 8. 1 Principal Component Analysis (PCA) n n Covered by Duda – later in semester 8. 2 k-means Clustering n Covered by Duda – later in semester

20 20 10 10 0 0 z 2 x 2 Principal Components Analysis -10 -20 -20 -10 0 x 1 10 20 -10 0 z 1 10 20 Figure 5. 8 (Goodfellow 2016)

9 Stochastic Gradient Descent n n Nearly all of deep learning is powered by stochastic gradient descent (SGD) SGD is an extension of gradient descent introduced on section 4. 3

10 Building a Machine Learning Algorithm n Deep learning algorithms use a simple recipe n Combine a specification of a dataset, a cost function, an optimization procedure, and a model

11 Challenges Motivating Deep Learning n n The simple machine learning algorithms work well on many important problems But they don’t solve the central problems of AI n n Recognizing objects, etc. Deep learning was motivated by this failure

11. 1 Curse of Dimensionality n n n Machine learning problems get difficult when the number of dimensions in the data is high This phenomenon is known as the curse of dimensionality Next figure shows n n n 10 regions of interest in 1 D 100 regions in 2 D 1000 regions in 3 D

Curse of Dimensionality Figure 5. 9 (Goodfellow 2016)

11. 2 Local Consistency and Smoothness Regularization n n To generalize well, machine learning algorithms need to be guided by prior beliefs One of the most widely used “priors” is the smoothness prior or local constancy prior n n Function should change little within a small region To distinguish O(k) regions in input space requires O(k) samples n For the k. NN algorithm, each training sample defines at most one region

Nearest Neighbor Figure 5. 10 (Goodfellow 2016)

11. 3 Manifold Learning n n A manifold is a connected region Manifold algorithms reduce space of interest n n n Such as variation across n-dimensional Euclidean space reduced by assuming that most of the space consists of invalid input Probability distributions over images, text strings, and sounds that occur in real life are highly concentrated Fig 11 is a manifold, 12 is static noise, 13 shows the manifold structure of a dataset of faces

Manifold Learning 2. 5 2. 0 1. 5 1. 0 0. 5 0. 0 -0. 5 -1. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 Figure 5. 11 (Goodfellow 2016)

Uniformly Sampled Images Figure 5. 12 (Goodfellow 2016)

QMUL Dataset Figure 5. 13 (Goodfellow 2016)