Peter Richtrik Parallel coordinate descent methods Simons Institute
- Slides: 59
Peter Richtárik Parallel coordinate descent methods Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013
Randomized Coordinate Descent in 2 D
2 D Optimization Contours of a function Goal: Find the minimizer of
Randomized Coordinate Descent in 2 D N E W S
Randomized Coordinate Descent in 2 D N 1 E W S
Randomized Coordinate Descent in 2 D 2 1 N E W S
Randomized Coordinate Descent in 2 D 3 2 1 N E W S
Randomized Coordinate Descent in 2 D 4 3 2 1 N E W S
Randomized Coordinate Descent in 2 D 5 4 3 2 1 N E W S
Randomized Coordinate Descent in 2 D 6 5 3 4 2 1 N E W S
Randomized Coordinate Descent in 2 D ! D O S E LV 7 6 5 3 4 2 1 N E W S
Convergence of Randomized Coordinate Descent Strongly convex F n n o s u c o ) F n g i b = a t a d g i b ( Smooth or ‘simple’ nonsmooth F ‘difficult’ nonsmooth F
Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration
How (not) to Parallelize Coordinate Descent
“Naive” parallelization Do the same thing as before, but for MORE or ALL coordinates & ADD UP the updates
Failure of naive parallelization 1 b 0 1 a
Failure of naive parallelization 1 b 0 1 1 a
Failure of naive parallelization 2 b 1 2 a
Failure of naive parallelization 2 b 2 1 2 a
Failure of naive parallelization ! S P O O 2
Idea: averaging updates may help 1 b E LV ! D O S 0 1 1 a
Averaging can be too conservative and 2 b 1 b 2 1 0 2 a 1 a so o n. . .
Averaging may be too conservative 2 But we wanted: BAD!!! WANT
What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:
Optimization Problems
Problem Loss Convex (smooth or nonsmooth) Regularizer Convex (smooth or nonsmooth) - separable - allow
Regularizer: examples e. g. , LASSO No regularizer Weighted L 1 norm Box constraints Weighted L 2 norm e. g. , SVM dual
Loss: examples Quadratic loss Logistic loss Square hinge loss BKBG’ 11 RT’ 11 b TBRS’ 13 RT ’ 13 a L-infinity L 1 regression Exponential loss FR’ 13
3 models for f with small 1 Smooth partially separable f [RT’ 11 b 2 Nonsmooth max-type f [FR’ 13] 3 f with ‘bounded Hessian’ [BKBG’ 11, RT’ 13 a ] ]
General Theory
Randomized Parallel Coordinate Descent Method New iterate Current iterate i-th unit coordinate vector Random set of coordinates (sampling) Update to i-th coordinate
ESO: Expected Separable Overapproximation Definition [RT’ 11 b] Shorthand: M ini mi ze in h 1. Separable in h 2. Can minimize in parallel 3. Can compute updates for only
Convergence rate: convex f Theorem [RT’ 11 b] stepsize parameter # iterations implies # coordinates average # updated coordinates per iteration error tolerance
Convergence rate: strongly convex f Theorem [RT’ 11 b] Strong convexity constant of the regularizer Strong convexity constant of the loss f implies
Partial Separability and Doubly Uniform Samplings
Serial uniform sampling Probability law:
-nice sampling Good for shared memory systems Probability law:
Doubly uniform sampling Can model unreliable processors / machines Probability law:
ESO for partially separable functions and doubly uniform samplings 1 Smooth partially separable f [RT’ 11 b Theorem [RT’ 11 b] ]
Theoretical speedup # coordinates degree of partial separability # coordinate updates / iter LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems Much of Big Data is here! WEAK OR NO SPEEDUP: Non-separable (dense) problems
Theory n = 1000 (# coordinates)
Practice n = 1000 (# coordinates)
Experiment with a 1 billion-by-2 billion LASSO problem
Optimization with Big Data = Extreme* Mountain Climbing * in a billion dimensional space on a foggy day
Coordinate Updates
Iterations
Wall Time
Distributed-Memory Coordinate Descent
Distributed -nice sampling Good for a distributed version of coordinate descent Probability law: Machine 1 Machine 2 Machine 3
ESO: Distributed setting 3 f with ‘bounded Hessian’ [BKBG’ 11, RT’ 13 a Theorem [RT’ 13 b] spectral norm of the data ]
Bad partitioning at most doubles # of iterations spectral norm of the partitioning Theorem [RT’ 13 b] implies # iterations = # updates/node # nodes
LASSO with a 3 TB data matrix = # coordinates 128 Cray XE 6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16 -cores with 32 GB RAM
Conclusions • Coordinate descent methods scale very well to big data problems of special structure – Partial separability (sparsity) – Small spectral norm of the data – Nesterov separability, . . . • Care is needed when combining updates (add them up? average? )
References: serial coordinate descent • Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L 1 -regularized loss minimization. JMLR 2011. • Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2): 341 -362, 2012. • [RT’ 11 b] P. R. and Martin Takáč, Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Prog. , 2012. • Rachael Tappenden, P. R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, ar. Xiv: 1304. 5530, 2013. • Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012. • Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Technical report, Microsoft Research, 2013.
References: parallel coordinate descent Good entry point to the topic (4 p paper) • [BKBG’ 11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L 1 -Regularized Loss Minimization. ICML 2011 • [RT’ 12] P. R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. ar. Xiv: 1212. 0873, 2012 • Martin Takáč, Avleen Bijral, P. R. , and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013 • [FR’ 12] Olivier Fercoq and P. R. , Smooth minimization of nonsmooth functions with parallel coordinate descent methods. ar. Xiv: 1309. 5885, 2013 • [RT’ 13 a] P. R. and Martin Takáč, Distributed coordinate descent method for big data learning. ar. Xiv: 1310. 2059, 2013 • [RT’ 13 b] P. R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. ar. Xiv: 1310. 3438, 2013
References: parallel coordinate descent • P. R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012. • Rachael Tappenden, P. R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. ar. Xiv: 1308. 6774, 2013. • Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with Map. Reduce. IEEE Transactions on Knowledge and Data Engineering, 24(10): 1904 -1916, 2012. • Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS 2013. TALK TOMORROW
- Peter simons llc
- Post coordinate indexing system example
- Coordinate covalent bond vs covalent bond
- Simons and chabris study
- Emmy simons
- Closed pod solution
- Menno simons beliefs
- Griet lucas
- Simons attribut
- Relacion cliente proveedor
- Chern simons theory
- Doug simons
- Frederik simons princeton
- Theo jan simons
- Varvae
- Parallel structure with coordinating conjunctions
- Direct and indirect wax pattern
- Bmi504
- An introduction to parallel programming peter pacheco
- 2 examples of like parallel forces
- Outer terminus fingerprint
- Non parallel sentence
- Parallel structure.
- Parallel structure means
- Bidirectional shift register with parallel load
- Parallel struxture
- The work done by gravity during the descent of a projectile
- Visual descent point
- In panic mode recovery of ll(1) parsing ___________
- Visual descent point
- Recursive descent parser
- Virgil bases his katabasis (descent) of aeneas on
- Texas descent and distribution before 1993
- Gradient descent equation
- Identical by descent vs identical by state
- Recursive descent parser
- Recursive descent parser
- Limitations of recursive descent parser
- Plant breeding
- Bann hadi
- Linear regression multiple features
- Site:slidetodoc.com
- Visual descent point
- Logistic regression stochastic gradient descent
- Partogram examples
- Lord of the flies descent into savagery
- Cross vs parallel cousins
- Ambilineal descent
- Recursive descent parser in c
- Descent with modification: a darwinian view of life
- Chapter 22: descent with modification
- Chapter 22 descent with modification
- Chapter 22: descent with modification
- Ambilineal descent
- Common descent
- Single seed descent method
- Professor raymond flood
- Linear regression gradient descent
- Chapter 19 descent with modification
- Jvp curve