SemiStochastic Gradient Descent Methods Jakub Konen joint work
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh
Introduction
Large scale problem setting � Problems are often structured Structure – sum of functions � Frequently is BIG arising in machine learning
Examples � Linear regression (least squares) � � Logistic � regression (classification)
Assumptions � Lipschitz � Strong continuity of derivative of convexity of
Gradient Descent (GD) � Update � Fast rule convergence rate � Alternatively, for accuracy we need iterations � Complexity of single iteration – (measured in gradient evaluations)
Stochastic Gradient Descent (SGD) � Update rule � Why it works � Slow convergence � Complexity a step-size parameter of single iteration – (measured in gradient evaluations)
Goal GD SGD Fast convergence Slow convergence gradient evaluations in each iteration Complexity of iteration independent of Combine in a single algorithm
Semi-Stochastic Gradient Descent S 2 GD
Intuition � The gradient does not change drastically � We could reuse the information from “old” gradient
Modifying “old” gradient � Imagine someone gives us a “good” point � Gradient and at point , near , can be expressed as Gradient change We can try to estimate � Approximation of the gradient Already computed gradient
The S 2 GD Algorithm Simplification; size of the inner loop is random, following a geometric rule
Theorem
Convergence rate For any fixed , can be made arbitrarily small by increasing Can be made arbitrarily small, by decreasing � How to set the parameters ?
Setting the parameters Fix target accuracy � The accuracy is achieved by setting # of epochs stepsize # of iterations � Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations
Complexity � S 2 GD � GD complexity � � � Total iterations complexity of a single iteration
Related Methods � SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013) � Refresh single stochastic gradient in each iteration � Need to store gradients. � Similar convergence rate � Cumbersome analysis � MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014) � Similar to SAG, slightly worse performance � Elegant analysis
Related Methods � SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013) � Arises as a special case in S 2 GD � Prox-SVRG (Tong Zhang, Lin Xiao, 2014) � Extended � EMGD to proximal setting – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013) � Handles simple constraints, � Worse convergence rate
Experiment � Example problem, with
- Slides: 19