Pipe Dream Pipeline Parallelism for DNN Training Aaron

Example DNN 1000 s Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/

DNN - How they Work Good NFL Coach? Yes: 12% 99% Input Output No:

DNN Training - How do they Learn Forward Pass - Make a Prediction Input

Data-Parallel Training • Separate copy of model on each machine • Communicate updates to

Problem w/ Data Parallel • Communication overhead can be high • Less communication lowers

New Approach: Pipeline Parallel • Assign layers to machines - communicate inter-layer activations Carnegie

Alternate forward / backward work • Each machine does forward work then backwards work,

One Quick Result - Pipeline 2. 5 x better • VGG Network w/ Image.

Many More Interesting Details • Staleness vs Speed trade-off • How staleness in pipelining

Summary • New way to parallelize for DNN training • Communicate layer activations instead

Slides: 11

Download presentation

Pipe. Dream: Pipeline Parallelism for DNN Training Aaron Harlap, Deepak Narayanan, Amar Phanishayee*, Vivek Seshadri*, Greg Ganger, Phil Gibbons PARALLEL DATA LABORATORY Carnegie Mellon University Microsoft Research* Carnegie Mellon Parallel Data Laboratory

Example DNN 1000 s Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 2 Aaron Harlap © October 17

DNN - How they Work Good NFL Coach? Yes: 12% 99% Input Output No: 88% 1% Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 3 Aaron Harlap © October 17

DNN Training - How do they Learn Forward Pass - Make a Prediction Input Output Backward Pass - Update Solution Depending on Error Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 4 Aaron Harlap © October 17

Data-Parallel Training • Separate copy of model on each machine • Communicate updates to model parameters - various synchronization models can be used Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 5 Aaron Harlap © October 17

Problem w/ Data Parallel • Communication overhead can be high • Less communication lowers quality of work Communication Computation VGG Over. Feat Inception-BN Goo. Le. Net Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 6 Aaron Harlap © October 17

New Approach: Pipeline Parallel • Assign layers to machines - communicate inter-layer activations Carnegie Machine Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 1 Machine 2 Machine 3 7 Aaron Harlap © October 17

Alternate forward / backward work • Each machine does forward work then backwards work, then forward work etc… Machine 1 Forward Compute Work #5 Backward Compute Work #1 Fw. Send #5 Machine 2 Forward Compute Work #4 Backward Compute Work #2 Fw. Send #4 Bw. Send #1 Machine 3 Forward Compute Work #3 Backward Compute Work #3 Bw. Send #2 Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 8 Aaron Harlap © October 17

One Quick Result - Pipeline 2. 5 x better • VGG Network w/ Image. Net Dataset Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 9 Aaron Harlap © October 17

Many More Interesting Details • Staleness vs Speed trade-off • How staleness in pipelining is different from data-parallel staleness • How to divide works amongst machines • GPU Memory Management • This and more at our poster! Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 10 Aaron Harlap © October 17

Summary • New way to parallelize for DNN training • Communicate layer activations instead of model parameters • Improves over data-parallel for networks w/ large solution state Carnegie Mellon Parallel Data Laboratory http: //www. pdl. cmu. edu/ 11 Aaron Harlap © October 17