Flexpoint Predictive Numerics for Deep Learning Valentina Popescu

  • Slides: 13
Download presentation
Flexpoint: Predictive Numerics for Deep Learning Valentina Popescu, Marcel Nassar, Xin Wang, Evren Tumer,

Flexpoint: Predictive Numerics for Deep Learning Valentina Popescu, Marcel Nassar, Xin Wang, Evren Tumer, Tristan Webb ARTIFICIAL INTELLIGENCE PRODUCTS GROUP, INTEL CORPORATION

 • Branch of machine learning inspired by the human brain; • Layered models

• Branch of machine learning inspired by the human brain; • Layered models containing millions of parameters; • Vast amount of data (organized into tensors) needed for training; full pass over the data set comprises an epoch; • Training done using stochastic gradient descent backpropagation; • Major computational workload is in convolutional layers (additions and multiplications). 2

binary 64 binary 32 binary 16 int 32 int 16 int 8 binary Workloads:

binary 64 binary 32 binary 16 int 32 int 16 int 8 binary Workloads: Inference vs. Training 3

 • Activations can be normalized to have a narrow range of values across

• Activations can be normalized to have a narrow range of values across the entire set; • Layer parameters change slowly in terms of order of magnitude during the course of training; • Gradients may become very small comparing to the parameter's values, thus their value being discarded during update as rounding error; • Tensor operations may have a large number of multiply–accumulate (MAC) operations, which may lead to overflows during accumulation. 4

5

5

A deep Res. Net trained with the CIFAR 10 dataset for 165 epochs 6

A deep Res. Net trained with the CIFAR 10 dataset for 165 epochs 6

7

7

A two-layer perceptron trained for 400 iterations on the CIFAR 10 dataset 8

A two-layer perceptron trained for 400 iterations on the CIFAR 10 dataset 8

9

9

 • Overflow is detrimental to deep neural network training; • To prevent overflow,

• Overflow is detrimental to deep neural network training; • To prevent overflow, one could store intermediate results in higher precision and then truncate it, but this annuls all potential savings; Our solution: • monitor a recent history of the absolute scale of each tensor, • use a sound statistical model to predict its trend, • estimate the probability of overflow, • preemptively adjust scale to prevent overflow. 10

Köster et al. 2017 11

Köster et al. 2017 11

Wasserstein GAN (LSUN bedroom dataset) Köster et al. 2017 12

Wasserstein GAN (LSUN bedroom dataset) Köster et al. 2017 12

13

13