Gradient Descent Review Gradient Descent In step 3

  • Slides: 38
Download presentation
Gradient Descent

Gradient Descent

Review: Gradient Descent • In step 3, we have to solve the following optimization

Review: Gradient Descent • In step 3, we have to solve the following optimization problem: L: loss function Suppose that θ has two variables {θ 1, θ 2}

Review: Gradient Descent Gradient: Loss 的等高線的法線方向 Gradient Movement ……

Review: Gradient Descent Gradient: Loss 的等高線的法線方向 Gradient Movement ……

Gradient Descent Tip 1: Tuning your learning rates

Gradient Descent Tip 1: Tuning your learning rates

Learning Rate If there are more than three parameters, you cannot visualize this. Set

Learning Rate If there are more than three parameters, you cannot visualize this. Set the learning rate η carefully Loss Very Large small Large Loss Just make No. of parameters updates But you can always visualize this.

Adaptive Learning Rates •

Adaptive Learning Rates •

Adagrad • Divide the learning rate of each parameter by the root mean square

Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives Vanilla Gradient descent w is one parameters Adagrad Parameter dependent

Adagrad

Adagrad

Adagrad • Divide the learning rate of each parameter by the root mean square

Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives 1/t decay

Contradiction? Vanilla Gradient descent Adagrad Larger gradient, larger step Larger gradient, smaller step

Contradiction? Vanilla Gradient descent Adagrad Larger gradient, larger step Larger gradient, smaller step

Intuitive Reason • How surprise it is 反差 特別大 g 0 g 1 g

Intuitive Reason • How surprise it is 反差 特別大 g 0 g 1 g 2 g 3 g 4 0. 001 0. 003 0. 002 0. 1 …… …… g 0 10. 8 …… …… g 1 20. 9 g 2 31. 7 g 3 12. 1 g 4 0. 1 特別小 造成反差的效果

Larger gradient, larger steps? Larger 1 st order derivative means far from the minima

Larger gradient, larger steps? Larger 1 st order derivative means far from the minima Best step:

Larger 1 st order derivative means far from the minima Do not cross parameters

Larger 1 st order derivative means far from the minima Do not cross parameters Comparison between different parameters a a>b b c c>d d

Second Derivative Best step: The best step is |First derivative| Second derivative

Second Derivative Best step: The best step is |First derivative| Second derivative

Larger 1 st order derivative means far from the minima Do not cross parameters

Larger 1 st order derivative means far from the minima Do not cross parameters Comparison between different parameters The best step is Larger Second |First derivative| Second derivative a a>b b smaller second derivative c Smaller Second c>d d Larger second derivative

The best step is |First derivative| ? Second derivative Use first derivative to estimate

The best step is |First derivative| ? Second derivative Use first derivative to estimate second derivative smaller second derivative larger second derivative

Gradient Descent Tip 2: Stochastic Gradient Descent Make the training faster

Gradient Descent Tip 2: Stochastic Gradient Descent Make the training faster

Stochastic Gradient Descent Loss is the summation over all training examples u. Gradient Descent

Stochastic Gradient Descent Loss is the summation over all training examples u. Gradient Descent u. Stochastic Gradient Descent Pick an example xn Loss for only one example Faster!

 • Demo

• Demo

Stochastic Gradient Descent Update after seeing all examples See all examples Update for each

Stochastic Gradient Descent Update after seeing all examples See all examples Update for each example If there are 20 examples, 20 times faster. See only one example See all examples

Gradient Descent Tip 3: Feature Scaling

Gradient Descent Tip 3: Feature Scaling

Feature Scaling Source of figure: http: //cs 231 n. github. io/neuralnetworks-2/ Make different features

Feature Scaling Source of figure: http: //cs 231 n. github. io/neuralnetworks-2/ Make different features have the same scaling

Feature Scaling 1, 2 …… 100, 200 …… 1, 2 …… Loss L

Feature Scaling 1, 2 …… 100, 200 …… 1, 2 …… Loss L

Feature Scaling For each dimension i: …… …… The means of all dimensions are

Feature Scaling For each dimension i: …… …… The means of all dimensions are 0, and the variances are all 1

Gradient Descent Theory

Gradient Descent Theory

Question • by gradient descent Is this statement correct?

Question • by gradient descent Is this statement correct?

Warning of Math

Warning of Math

Formal Derivation • Suppose that θ has two variables {θ 1, θ 2} Given

Formal Derivation • Suppose that θ has two variables {θ 1, θ 2} Given a point, we can easily find the point with the smallest value nearby. How? L(θ)

Taylor Series • Taylor series: Let h(x) be any function infinitely differentiable around x

Taylor Series • Taylor series: Let h(x) be any function infinitely differentiable around x = x 0. When x is close to x 0

E. g. Taylor series for h(x)=sin(x) around x 0=π/4 sin(x)= …… The approximation is

E. g. Taylor series for h(x)=sin(x) around x 0=π/4 sin(x)= …… The approximation is good around π/4.

Multivariable Taylor Series + something related to (x-x 0)2 and (y-y 0)2 + ……

Multivariable Taylor Series + something related to (x-x 0)2 and (y-y 0)2 + …… When x and y is close to x 0 and y 0

Back to Formal Derivation Based on Taylor Series: If the red circle is small

Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle L(θ)

Back to Formal Derivation constant Based on Taylor Series: If the red circle is

Back to Formal Derivation constant Based on Taylor Series: If the red circle is small enough, in the red circle Find θ 1 and θ 2 in the red circle minimizing L(θ) Simple, right? d

Gradient descent – two variables Red Circle: (If the radius is small) Find θ

Gradient descent – two variables Red Circle: (If the radius is small) Find θ 1 and θ 2 in the red circle minimizing L(θ) To minimize L(θ)

Back to Formal Derivation Based on Taylor Series: If the red circle is small

Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle constant This is gradient descent. Not satisfied if the red circle (learning rate) is not small enough You can consider the second order term, e. g. Newton’s method.

End of Warning

End of Warning

More Limitation of Gradient Descent Loss Very slow at the plateau Stuck at saddle

More Limitation of Gradient Descent Loss Very slow at the plateau Stuck at saddle point Stuck at local minima The value of the parameter w