CS 6501 Deep Learning for Visual Recognition Stochastic

Today’s Class Stochastic Gradient Descent (SGD) • SGD Recap • Regression vs Classification •

Our function L(w) Easy way to find minimum (and max): Find where This is

Our function L(w) But this is not easy for complex functions: 5

Our function L(w) Or even for simpler functions: How do you find x?

Gradient Descent (GD) (idea) 1. Start with a random value of w (e. g.

Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w

Gradient Descent (GD) Initialize w and b randomly for e = 0, num_epochs do

Gradient Descent (GD) expensive Initialize w and b randomly for e = 0, num_epochs

(mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0,

Regression vs Classification Regression • Labels are continuous variables – e. g. distance. •

Linear Regression – 1 output, 1 input Model:

Linear Regression – 1 output, 1 input Model: Loss:

Overfitting Underfitting High Bias Overfitting High Variance Christopher M. Bishop – Pattern Recognition and

Regularization • Large weights lead to large variance. i. e. model fits to the

SGD with Regularization (L-2) Initialize w and b randomly for e = 0, num_epochs

Revisiting Another Problem with SGD Initialize w and b randomly for e = 0,

Solution: Momentum Updates Initialize w and b randomly for e = 0, num_epochs do

More on Momentum https: //distill. pub/2017/momentum/

Slides: 30

Download presentation

CS 6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD)

Today’s Class Stochastic Gradient Descent (SGD) • SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates

Our function L(w) 3

Our function L(w) Easy way to find minimum (and max): Find where This is zero when: 4

Our function L(w) But this is not easy for complex functions: 5

Our function L(w) Or even for simpler functions: How do you find x?

Gradient Descent (GD) (idea) 1. Start with a random value of w (e. g. w = 12) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e. g. d. L/dw = 6) 3. Recompute w as: w = w – lambda * (d. L / dw) w=12 7

Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e. g. d. L/dw = 6) 3. Recompute w as: w = w – lambda * (d. L / dw) w=10 8

Gradient Descent (GD) (idea) 2. Compute the gradient (derivative) of L(w) at point w = 12. (e. g. d. L/dw = 6) 3. Recompute w as: w = w – lambda * (d. L / dw) w=8 9

Gradient Descent (GD) Initialize w and b randomly for e = 0, num_epochs do Compute: and Update w: Update b: Print: // Useful to see if this is becoming smaller or not. end 10

Gradient Descent (GD) expensive Initialize w and b randomly for e = 0, num_epochs do Compute: and Update w: Update b: Print: // Useful to see if this is becoming smaller or not. end 11

(mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: and Update b: Print: end // Useful to see if this is becoming smaller or not. 12

(mini-batch) Stochastic Gradient Descent (SGD) Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: and for |B| = 1 Update b: Print: end // Useful to see if this is becoming smaller or not. 13

Regression vs Classification Regression • Labels are continuous variables – e. g. distance. • Losses: Distance-based losses, e. g. sum of distances to true values. • Evaluation: Mean distances, correlation coefficients, etc. Classification • Labels are discrete variables (1 out of K categories) • Losses: Cross-entropy loss, margin losses, logistic regression (binary cross entropy) • Evaluation: Classification accuracy, etc.

Linear Regression – 1 output, 1 input

Linear Regression – 1 output, 1 input Model:

Linear Regression – 1 output, 1 input Model: Loss:

Quadratic Regression Model: Loss:

n-polynomial Regression Model: Loss:

Overfitting Underfitting High Bias Overfitting High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning

Regularization • Large weights lead to large variance. i. e. model fits to the training data too strongly. • Solution: Minimize the loss but also try to keep the weight values small by doing the following: minimize

SGD with Regularization (L-2) Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: and Update b: Print: end // Useful to see if this is becoming smaller or not. 24

Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: and Update b: Print: end // Useful to see if this is becoming smaller or not. 25

Revisiting Another Problem with SGD Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: Update b: Print: end and This could lead to “unlearning” what has been learned in some previous steps of training. // Useful to see if this is becoming smaller or not. 26

Solution: Momentum Updates Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: Update b: Print: end and Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. // Useful to see if this is becoming smaller or not. 27

Solution: Momentum Updates Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: Update w: Print: end Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. // Useful to see if this is becoming smaller or not. 28

More on Momentum https: //distill. pub/2017/momentum/

Questions? 30