Communication Efficient Stochastic Gradient Descent Milan Vojnovi Department
- Slides: 25
Communication Efficient Stochastic Gradient Descent Milan Vojnović Department of Statistics Joint work with Dan Alistarh, Jerry Li and Ryota Tomioka Huawei: Innovative Algorithmic and Mathematic Workshop for NG Network Technology, Paris, November 7 th, 2016
To Probe Further • QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent, D. Alistarh, J. Li, R. Tomioka, M. Vojnovic https: //arxiv. org/abs/1610. 02132 • Workshop OPTIM 2016, co-located with NIPS 2016, Barcelona, December 10, 2016 Work performed in part while all authors worked at Microsoft Research 2
Motivating Applications Amazon’s Alexa 3
Big data & big models • Vision • Image. Net: 1. 6 million images • Res. Net-152: 60 million parameters [He et al, Deep Residual Learning for Image Recognition, 2015] • Speech • NIST 2000 Switchboard dataset: 2, 000 hours • LACEA: 65 million parameters (w/o language model) [Yu et al, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, 2016]
Stochastic Gradient Descent (SGD) • projection step size stochastic gradient vector 5
Scaling Up SGD: Data-Parallel Computing 1 Gradient updates communicated to all nodes 2 k 6
Data Parallel SGD Compute gradient Mini-batch 1 Exchange gradient Update params Mini-batch 2
Mini-batch 1 Co gr mp ad re ien ss t Ex gr cha ad ng ien e De t gr com ad ien pre ss Up t pa dat ra e ms Co gra mpu die te nt With compressed gradients… Mini-batch 2
Two Examples of Lossy Gradient Compressions … 9
10
11
1 -bit SGD [Seide et al 2014] • float v 1 float v 2 float vp float vn float v 3 float v 4 float vn n bits sign Biased stochastic gradient updates, no theoretical guarantees
Questions • Can we compress the stochastic gradient updates to fewer bits while still having a convergence guarantee ? • Can this be achieved with a sublinear number of bits with respect to the parameter vector dimension ? 13
A Gadget • 14
Our Quantized SGD Scheme: Simple Version • 15
Simple Quantized SGD Scheme (Cont’d) • 16
Proof Sketch • 17
Our Quantized SGD Scheme: Full Version • 18
Generalized Quantized SGD Scheme (Cont’d) • 19
Smooth Trade-off Curve iterations bits per iteration 20
Experimental Results MNIST: handwritten digits CIFAR-10: object classification Training set: 60, 000 28 x 28 single digit images, test set: 10, 000 Model: two-layer perceptron Mini-batch size 256, Step size 0. 1 Number of parameters = 3. 3 M Training set: 50, 000 32 x 32 colour images; 1. 8 M images by translating, cropping and horizontal flipping Model: 9 2 D convolution layers and 3 fully connected layer Mini-batch size 256, Step size 0. 1 Number of parameters = 22 M 21
MNIST 22
CIFAR-10 23
Multiple GPU Nodes 24
Thanks 25
- Logistic regression stochastic gradient descent
- Stochastic gradient descent
- Stochastic gradient langevin dynamics
- Gradient descent equation
- Linear regression with multiple variables machine learning
- Linear regression gradient descent
- Batch gradient descent
- Batch gradient descent
- Gradient descent python implementation
- Gradient descent rule
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Allocative efficiency vs productive efficiency
- Productively efficient vs allocatively efficient
- Stochastic rounding
- Stochastic programming
- Asynchronnous
- Wan optimization tutorial
- Deterministic and stochastic inventory models
- Put call parity formula
- Stochastic vs dynamic
- Stochastic matrix
- Stochastic regressors
- Non stochastic theory of aging
- A first course in stochastic processes