Flexpoint: Predictive Numerics for Deep Learning Valentina Popescu, Marcel Nassar, Xin Wang, Evren Tumer, Tristan Webb ARTIFICIAL INTELLIGENCE PRODUCTS GROUP, INTEL CORPORATION

• Branch of machine learning inspired by the human brain; • Layered models containing millions of parameters; • Vast amount of data (organized into tensors) needed for training; full pass over the data set comprises an epoch; • Training done using stochastic gradient descent backpropagation; • Major computational workload is in convolutional layers (additions and multiplications). 2

binary 64 binary 32 binary 16 int 32 int 16 int 8 binary Workloads: Inference vs. Training 3

• Activations can be normalized to have a narrow range of values across the entire set; • Layer parameters change slowly in terms of order of magnitude during the course of training; • Gradients may become very small comparing to the parameter's values, thus their value being discarded during update as rounding error; • Tensor operations may have a large number of multiply–accumulate (MAC) operations, which may lead to overflows during accumulation. 4

5

A deep Res. Net trained with the CIFAR 10 dataset for 165 epochs 6

7

A two-layer perceptron trained for 400 iterations on the CIFAR 10 dataset 8

9

• Overflow is detrimental to deep neural network training; • To prevent overflow, one could store intermediate results in higher precision and then truncate it, but this annuls all potential savings; Our solution: • monitor a recent history of the absolute scale of each tensor, • use a sound statistical model to predict its trend, • estimate the probability of overflow, • preemptively adjust scale to prevent overflow. 10

Köster et al. 2017 11

Wasserstein GAN (LSUN bedroom dataset) Köster et al. 2017 12

13