Gradient Descent Review Gradient Descent In step 3
- Slides: 38
Gradient Descent
Review: Gradient Descent • In step 3, we have to solve the following optimization problem: L: loss function Suppose that θ has two variables {θ 1, θ 2}
Review: Gradient Descent Gradient: Loss 的等高線的法線方向 Gradient Movement ……
Gradient Descent Tip 1: Tuning your learning rates
Learning Rate If there are more than three parameters, you cannot visualize this. Set the learning rate η carefully Loss Very Large small Large Loss Just make No. of parameters updates But you can always visualize this.
Adaptive Learning Rates •
Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives Vanilla Gradient descent w is one parameters Adagrad Parameter dependent
Adagrad
Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives 1/t decay
Contradiction? Vanilla Gradient descent Adagrad Larger gradient, larger step Larger gradient, smaller step
Intuitive Reason • How surprise it is 反差 特別大 g 0 g 1 g 2 g 3 g 4 0. 001 0. 003 0. 002 0. 1 …… …… g 0 10. 8 …… …… g 1 20. 9 g 2 31. 7 g 3 12. 1 g 4 0. 1 特別小 造成反差的效果
Larger gradient, larger steps? Larger 1 st order derivative means far from the minima Best step:
Larger 1 st order derivative means far from the minima Do not cross parameters Comparison between different parameters a a>b b c c>d d
Second Derivative Best step: The best step is |First derivative| Second derivative
Larger 1 st order derivative means far from the minima Do not cross parameters Comparison between different parameters The best step is Larger Second |First derivative| Second derivative a a>b b smaller second derivative c Smaller Second c>d d Larger second derivative
The best step is |First derivative| ? Second derivative Use first derivative to estimate second derivative smaller second derivative larger second derivative
Gradient Descent Tip 2: Stochastic Gradient Descent Make the training faster
Stochastic Gradient Descent Loss is the summation over all training examples u. Gradient Descent u. Stochastic Gradient Descent Pick an example xn Loss for only one example Faster!
• Demo
Stochastic Gradient Descent Update after seeing all examples See all examples Update for each example If there are 20 examples, 20 times faster. See only one example See all examples
Gradient Descent Tip 3: Feature Scaling
Feature Scaling Source of figure: http: //cs 231 n. github. io/neuralnetworks-2/ Make different features have the same scaling
Feature Scaling 1, 2 …… 100, 200 …… 1, 2 …… Loss L
Feature Scaling For each dimension i: …… …… The means of all dimensions are 0, and the variances are all 1
Gradient Descent Theory
Question • by gradient descent Is this statement correct?
Warning of Math
Formal Derivation • Suppose that θ has two variables {θ 1, θ 2} Given a point, we can easily find the point with the smallest value nearby. How? L(θ)
Taylor Series • Taylor series: Let h(x) be any function infinitely differentiable around x = x 0. When x is close to x 0
E. g. Taylor series for h(x)=sin(x) around x 0=π/4 sin(x)= …… The approximation is good around π/4.
Multivariable Taylor Series + something related to (x-x 0)2 and (y-y 0)2 + …… When x and y is close to x 0 and y 0
Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle L(θ)
Back to Formal Derivation constant Based on Taylor Series: If the red circle is small enough, in the red circle Find θ 1 and θ 2 in the red circle minimizing L(θ) Simple, right? d
Gradient descent – two variables Red Circle: (If the radius is small) Find θ 1 and θ 2 in the red circle minimizing L(θ) To minimize L(θ)
Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle constant This is gradient descent. Not satisfied if the red circle (learning rate) is not small enough You can consider the second order term, e. g. Newton’s method.
End of Warning
More Limitation of Gradient Descent Loss Very slow at the plateau Stuck at saddle point Stuck at local minima The value of the parameter w
- Step 1 step 2 step 3 step 4
- Gradient descent equation
- Gradient descent multiple variables
- Logistic regression stochastic gradient descent
- Linear regression gradient descent
- Batch gradient descent
- Stochastic gradient descent
- Kay ousterhout
- Gradient descent python implementation
- Gradient descent rule
- Buccal pit molar
- The age of the dinosaurs text structure
- Steps to write an informative essay
- Steps of argumentative essay
- Step back step up
- How to factor equations
- Linear simultaneous equations
- Simultaneous equations step by step
- Combine like terms steps
- The process of photosynthesis step by step
- Particle filter matlab code
- Oracle real application testing step by step
- Netbackup bare metal restore step by step
- 5-3 solving trigonometric equations
- Naomi campbell face shape
- Step by step how to set up a punnett square
- Viva video collage
- Fusioncompute installation step by step
- Sine rule graph
- How to divide decimals step by step
- Negative and positive result of paraffin test
- Isosceles triangle arrangement
- Completing the square
- Completing the square conic sections
- Chemical equation examples
- Step-by step inventory process
- How to balance chemical equations step by step
- Blood flow through the heart step by step
- Chapter 15 musculoskeletal system practical