ETH Zurich November 3 2014 SemiStochastic Gradient Descent
ETH Zurich November 3, 2014 Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh
Introduction
Large scale problem setting � Problems are often structured Structure – sum of functions � Frequently is BIG arising in machine learning
Examples � Linear regression (least squares) � � Logistic � regression (classification)
Assumptions � Lipschitz � Strong continuity of derivative of convexity of
Gradient Descent (GD) � Update � Fast rule convergence rate � Alternatively, for accuracy we need iterations � Complexity of single iteration – (measured in gradient evaluations)
Stochastic Gradient Descent (SGD) � Update rule � Why it works � Slow convergence � Complexity a step-size parameter of single iteration – (measured in gradient evaluations)
Goal GD SGD Fast convergence Slow convergence gradient evaluations in each iteration Complexity of iteration independent of Combine in a single algorithm
Semi-Stochastic Gradient Descent S 2 GD
Intuition � The gradient does not change drastically � We could reuse the information from “old” gradient
Modifying “old” gradient � Imagine someone gives us a “good” point � Gradient and at point , near , can be expressed as Gradient change We can try to estimate � Approximation of the gradient Already computed gradient
The S 2 GD Algorithm Simplification; size of the inner loop is random, following a geometric rule
Theorem
Convergence rate For any fixed , can be made arbitrarily small by increasing Can be made arbitrarily small, by decreasing � How to set the parameters ?
Setting the parameters Fix target accuracy � The accuracy is achieved by setting # of epochs stepsize # of iterations � Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations
Complexity � S 2 GD � GD complexity � � � Total iterations complexity of a single iteration
Related Methods � SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013) � Refresh single stochastic gradient in each iteration � Need to store gradients. � Similar convergence rate � Cumbersome analysis � SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014) � Refined analysis � MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014) � Similar to SAG, slightly worse performance � Elegant analysis
Related Methods � SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013) � Arises as a special case in S 2 GD � Prox-SVRG (Tong Zhang, Lin Xiao, 2014) � Extended � EMGD to proximal setting – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013) � Handles simple constraints, � Worse convergence rate
Experiment (logistic regression on: ijcnn, rcv, real-sim, url)
Extensions
Sparse data � For linear/logistic regression, gradient copies sparsity pattern of example. � But the update direction is fully dense SPARSE � Can DENSE we do something about it?
Sparse data � Yes we can! � To compute , we only need coordinates of corresponding to nonzero elements of � For each coordinate , remember when was it updated last time – � Before computing in inner iteration number , update required coordinates � Step being � Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”
Sparse data implementation
S 2 GD+ � Observing that SGD can make reasonable progress, while S 2 GD computes first full gradient (in case we are starting from arbitrary point), we can formulate the following algorithm (S 2 GD+)
S 2 GD+ Experiment
High Probability Result � The result holds only in expectation � Can we say anything about the concentration of the result in practice? Paying just logarithm of probability Independent from other parameters For any we have:
Code � Efficient implementation for logistic regression available at MLOSS http: //mloss. org/software/view/556/
m. S 2 GD (mini-batch S 2 GD) � How does mini-batching influence the algorithm? � Replace by � Provides two-fold speedup � Provably less gradient evaluations are needed (up to certain number of mini-batches) � Easy possibility of parallelism � Still preliminary work
S 2 CD (Semi-Stochastic Coordinate Descent) � Coordinate updates? � Sample non-uniformly and scale updates works � Needs more cheaper iterations � Still preliminary work
S 2 GD as a Learning Algorithm
Machine Learning Setting � Space of input-output pairs � Unknown distribution � A relationship between inputs and outputs � Loss function to measure discrepancy between predicted and real output � Define Expected Risk
Machine Learning Setting � Ideal � But goal: Find such that, you cannot even evaluate � Define Expected Risk
Machine Learning Setting � We at least have iid samples � Define Empirical Risk
Machine Learning Setting � First learning principle – fix a family prediction functions � Find Empirical Minimizer � Define Empirical Risk of candidate
Machine Learning Setting � Since optimal we also define � Define is unlikely to belong to , Empirical Risk
Machine Learning Setting � Finding by minimizing the Empirical Risk exactly is often computationally expensive � Run optimization algorithm that returns such that � Define Empirical Risk
Recapitulation Ideal optimum “Best” from our family Empirical Minimizer From approximate optimization
Machine Learning Goal � Big goal is to minimize the Excess Risk � Approximation error � Estimation Error � Optimization Error
Generic Machine Learning Problem � All this leads to a complicated compromise � Three variables � Family of functions � Number or examples � Optimization accuracy � Two constraints � Maximal number of examples � Maximal computational time available
Generic Machine Learning Problem � Small scale learning problem � If first inequality is tight � Can reduce to insignificant levels and recover approximation-estimation tradeoff (well studied) � Large � If scale learning problem second inequality is tight � More complicated compromise
Solving Large Scale ML Problem � Several simplifications needed � Not carefully balance three terms; instead we only ensure that asymptotically � Consider fixed family of functions , linearly parameterized by a vector � Effectively � Simplifies setting to be a constant to Estimation–Optimization tradeoff
Estimation–Optimization tradeoff � Using uniform convergence bounds, one can obtain � Often considered weak
Estimation–Optimization tradeoff � Using Localized Bounds (Bousquet, Ph. D thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get … if we can establish the following variance condition � Often , for example under strong convexity, or making assumptions on the data distribution
Estimation–Optimization tradeoff � Using the previous bounds yields where is an absolute constant � We want to push this term below � Choosing and using and we get the following table
- Slides: 44