SemiStochastic Gradient Descent Methods Jakub Konen joint work

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh

Introduction

Large scale problem setting � Problems are often structured Structure – sum of functions � Frequently is BIG arising in machine learning

Examples � Linear regression (least squares) � � Logistic � regression (classification)

Assumptions � Lipschitz � Strong continuity of derivative of convexity of

Gradient Descent (GD) � Update � Fast rule convergence rate � Alternatively, for accuracy we need iterations � Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD) � Update rule � Why it works � Slow convergence � Complexity a step-size parameter of single iteration – (measured in gradient evaluations)

Goal GD SGD Fast convergence Slow convergence gradient evaluations in each iteration Complexity of iteration independent of Combine in a single algorithm

Semi-Stochastic Gradient Descent S 2 GD

Intuition � The gradient does not change drastically � We could reuse the information from “old” gradient

Modifying “old” gradient � Imagine someone gives us a “good” point � Gradient and at point , near , can be expressed as Gradient change We can try to estimate � Approximation of the gradient Already computed gradient

The S 2 GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate For any fixed , can be made arbitrarily small by increasing Can be made arbitrarily small, by decreasing � How to set the parameters ?

Setting the parameters Fix target accuracy � The accuracy is achieved by setting # of epochs stepsize # of iterations � Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations

Complexity � S 2 GD � GD complexity � � � Total iterations complexity of a single iteration

Related Methods � SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013) � Refresh single stochastic gradient in each iteration � Need to store gradients. � Similar convergence rate � Cumbersome analysis � MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014) � Similar to SAG, slightly worse performance � Elegant analysis

Related Methods � SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013) � Arises as a special case in S 2 GD � Prox-SVRG (Tong Zhang, Lin Xiao, 2014) � Extended � EMGD to proximal setting – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013) � Handles simple constraints, � Worse convergence rate

Experiment � Example problem, with