Network Compression and Speedup SHUOCHAO YAO YIWEN XU

  • Slides: 70
Download presentation
Network Compression and Speedup SHUOCHAO YAO, YIWEN XU, DANIEL CALZADA NETWORK COMPRESSION AND SPEEDUP

Network Compression and Speedup SHUOCHAO YAO, YIWEN XU, DANIEL CALZADA NETWORK COMPRESSION AND SPEEDUP 1

Source: http: //isca 2016. eecs. umich. edu/wp-content/uploads/2016/07/4 A-1. pdf NETWORK COMPRESSION AND SPEEDUP 2

Source: http: //isca 2016. eecs. umich. edu/wp-content/uploads/2016/07/4 A-1. pdf NETWORK COMPRESSION AND SPEEDUP 2

Why smaller models? Operation Energy [p. J] Relative Cost 32 bit int ADD 0.

Why smaller models? Operation Energy [p. J] Relative Cost 32 bit int ADD 0. 1 1 32 bit float ADD 0. 9 9 32 bit Register File 1 10 32 bit int MULT 3. 1 31 32 bit float MULT 3. 7 37 32 bit SRAM Cache 5 50 32 bit DRAM Memory 6400 Source: http: //isca 2016. eecs. umich. edu/wp-content/uploads/2016/07/4 A-1. pdf NETWORK COMPRESSION AND SPEEDUP 3

 Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding Design

Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 4

 Matrix Factorization ◦ Singular Value Decomposition (SVD) ◦ Flattened Convolutions Outline Weight Pruning

Matrix Factorization ◦ Singular Value Decomposition (SVD) ◦ Flattened Convolutions Outline Weight Pruning Quantization method Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 5

Fully Connected Layers: Singular Value Decomposition http: //www. alglib. net/matrixops/general/i/svd 1. gif NETWORK COMPRESSION

Fully Connected Layers: Singular Value Decomposition http: //www. alglib. net/matrixops/general/i/svd 1. gif NETWORK COMPRESSION AND SPEEDUP 6

Singular Value Decomposition http: //www. alglib. net/matrixops/general/i/svd 1. gif NETWORK COMPRESSION AND SPEEDUP 7

Singular Value Decomposition http: //www. alglib. net/matrixops/general/i/svd 1. gif NETWORK COMPRESSION AND SPEEDUP 7

SVD: Compression Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization. "

SVD: Compression Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization. " ar. Xiv preprint ar. Xiv: 1412. 6115 (2014). NETWORK COMPRESSION AND SPEEDUP 8

SVD: Compression Results Trained on Image. Net 2012 database, then compressed 5 convolutional layers,

SVD: Compression Results Trained on Image. Net 2012 database, then compressed 5 convolutional layers, 3 fully connected layers, softmax output layer Denton, Emily L. , et al. "Exploiting linear structure within convolutional networks for efficient evaluation. " Advances in Neural Information Processing Systems. 2014. NETWORK COMPRESSION AND SPEEDUP 9

SVD: Side Benefits Denton, Emily L. , et al. "Exploiting linear structure within convolutional

SVD: Side Benefits Denton, Emily L. , et al. "Exploiting linear structure within convolutional networks for efficient evaluation. " Advances in Neural Information Processing Systems. 2014. NETWORK COMPRESSION AND SPEEDUP 10

Convolutions: Matrix Multiplication Most time is spent in the convolutional layers http: //stackoverflow. com/questions/15356153/how-do-convolution-matrices-work

Convolutions: Matrix Multiplication Most time is spent in the convolutional layers http: //stackoverflow. com/questions/15356153/how-do-convolution-matrices-work NETWORK COMPRESSION AND SPEEDUP 11

Flattened Convolutions Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for

Flattened Convolutions Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration. " ar. Xiv preprint ar. Xiv: 1412. 5474 (2014). NETWORK COMPRESSION AND SPEEDUP 12

Flattened Convolutions Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for

Flattened Convolutions Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration. " ar. Xiv preprint ar. Xiv: 1412. 5474 (2014). NETWORK COMPRESSION AND SPEEDUP 13

Flattening = MF Denton, Emily L. , et al. "Exploiting linear structure within convolutional

Flattening = MF Denton, Emily L. , et al. "Exploiting linear structure within convolutional networks for efficient evaluation. " Advances in Neural Information Processing Systems. 2014. NETWORK COMPRESSION AND SPEEDUP 14

Flattening: Speedup Results 3 convolutional layers (5 x 5 filters) with 96, 128, and

Flattening: Speedup Results 3 convolutional layers (5 x 5 filters) with 96, 128, and 256 channels Used stacks of 2 rank-1 convolutions Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration. " ar. Xiv preprint ar. Xiv: 1412. 5474 (2014). NETWORK COMPRESSION AND SPEEDUP 15

 Matrix Factorization Weight Pruning ◦ Magnitude-based method Outline ◦ Iterative pruning + Retraining

Matrix Factorization Weight Pruning ◦ Magnitude-based method Outline ◦ Iterative pruning + Retraining ◦ Pruning with rehabilitation ◦ Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 16

Magnitude-based method: Iterative Pruning + Retraining Han, Song, et al. "Learning both weights and

Magnitude-based method: Iterative Pruning + Retraining Han, Song, et al. "Learning both weights and connections for efficient neural network. " NIPS. 2015. NETWORK COMPRESSION AND SPEEDUP 18

Magnitude-based method: Iterative Pruning + Retraining (Algorithm) Han, Song, et al. "Learning both weights

Magnitude-based method: Iterative Pruning + Retraining (Algorithm) Han, Song, et al. "Learning both weights and connections for efficient neural network. " NIPS. 2015. NETWORK COMPRESSION AND SPEEDUP 19

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Alex. Net) Layer Weights FLOP Act% Weights%

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Alex. Net) Layer Weights FLOP Act% Weights% FLOP% conv 1 35 K 211 M 88% 84% conv 2 307 K 448 M 52% 38% 33% conv 3 885 K 299 M 37% 35% 18% conv 4 663 K 224 M 40% 37% 14% conv 5 442 K 150 M 34% 37% 14% fc 1 38 M 75 M 36% 9% 3% fc 2 17 M 34 M 40% 9% 3% fc 3 4 M 8 M 100% 25% 10 Total 61 M 1. 5 B 54% 11% 30% Han, Song, et al. "Learning both weights and connections for efficient neural network. " NIPS. 2015. NETWORK COMPRESSION AND SPEEDUP 22

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Tradeoff) Han, Song, et al. "Learning both

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Tradeoff) Han, Song, et al. "Learning both weights and connections for efficient neural network. " NIPS. 2015. NETWORK COMPRESSION AND SPEEDUP 23

Pruning with rehabilitation: Dynamic Network Surgery (Motivation) Pruned connections have no chance to come

Pruning with rehabilitation: Dynamic Network Surgery (Motivation) Pruned connections have no chance to come back. Incorrect pruning may cause severe accuracy loss. Avoid the risk of irretrievable network damage. Improve the learning efficiency. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs. " NIPS. 2016. NETWORK COMPRESSION AND SPEEDUP 24

Pruning with rehabilitation: Dynamic Network Surgery (Formulation) Guo, Yiwen, et al. "Dynamic Network Surgery

Pruning with rehabilitation: Dynamic Network Surgery (Formulation) Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs. " NIPS. 2016. NETWORK COMPRESSION AND SPEEDUP 25

Pruning with rehabilitation: Dynamic Network Surgery (Algorithm) Guo, Yiwen, et al. "Dynamic Network Surgery

Pruning with rehabilitation: Dynamic Network Surgery (Algorithm) Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs. " NIPS. 2016. NETWORK COMPRESSION AND SPEEDUP 26

Pruning with rehabilitation: Dynamic Network Surgery (Experiment on Alex. Net) Layer Parameters (Han et

Pruning with rehabilitation: Dynamic Network Surgery (Experiment on Alex. Net) Layer Parameters (Han et al. 2015) Parameters (DNS) conv 1 35 K 84% 53. 8% conv 2 307 K 38% 40. 6% conv 3 885 K 35% 29. 0% conv 4 664 K 37% 32. 3% conv 5 443 K 37% 32. 5% fc 1 38 M 9% 3. 7% fc 2 17 M 9% 6. 6% fc 3 4 M 25% 4. 6% Total 61 M 11% 5. 7% Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs. " NIPS. 2016. NETWORK COMPRESSION AND SPEEDUP 28

 Matrix Factorization Weight Pruning ◦ Magnitude-based method ◦ Hessian-based method Outline ◦ Diagonal

Matrix Factorization Weight Pruning ◦ Magnitude-based method ◦ Hessian-based method Outline ◦ Diagonal Hessian-based method ◦ Full Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 29

Diagonal Hessian-based method: Optimal Brain Damage The idea of model compression & speed up:

Diagonal Hessian-based method: Optimal Brain Damage The idea of model compression & speed up: traced by to 1990. Actually theoretically more “optimal” compared with the current state of the art, but much more computational inefficient. Delete parameters with small “saliency”. ◦ Saliency: effect on the training error Propose a theoretically justified saliency measure. NETWORK COMPRESSION AND SPEEDUP 30

Diagonal Hessian-based method: Optimal Brain Damage (Formulation) Le. Cun, Yann, et al. "Optimal brain

Diagonal Hessian-based method: Optimal Brain Damage (Formulation) Le. Cun, Yann, et al. "Optimal brain damage. " NIPs. Vol. 2. 1989. NETWORK COMPRESSION AND SPEEDUP 31

Diagonal Hessian-based method: Optimal Brain Damage (Algorithm) Le. Cun, Yann, et al. "Optimal brain

Diagonal Hessian-based method: Optimal Brain Damage (Algorithm) Le. Cun, Yann, et al. "Optimal brain damage. " NIPs. Vol. 2. 1989. NETWORK COMPRESSION AND SPEEDUP 32

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: OBD vs. Magnitude) OBD vs. Magnitude Deletion

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: OBD vs. Magnitude) OBD vs. Magnitude Deletion based on saliency performs better Le. Cun, Yann, et al. "Optimal brain damage. " NIPs. Vol. 2. 1989. NETWORK COMPRESSION AND SPEEDUP 33

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: Retraining) How retraining helps? Without retraining Retraining

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: Retraining) How retraining helps? Without retraining Retraining Le. Cun, Yann, et al. "Optimal brain damage. " NIPs. Vol. 2. 1989. NETWORK COMPRESSION AND SPEEDUP 34

Full Hessian-based method: Optimal Brain Surgeon Motivation: ◦ A more accurate estimation of saliency.

Full Hessian-based method: Optimal Brain Surgeon Motivation: ◦ A more accurate estimation of saliency. ◦ Optimal weight updates. Advantage: ◦ More accuracy estimation with saliency. ◦ Directly provide the weight updates, which minimize the change of objective function. Disadvantage ◦ More computation compared with OBD. ◦ Weight updates are not based on minimizing the objective function. Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon. ” NIPS, 1993 NETWORK COMPRESSION AND SPEEDUP 35

Full Hessian-based method: Optimal Brain Surgeon (Formulation) Hassibi, Babak, and David G. Stork. "Second

Full Hessian-based method: Optimal Brain Surgeon (Formulation) Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon. ” NIPS, 1993 NETWORK COMPRESSION AND SPEEDUP 36

Full Hessian-based method: Optimal Brain Surgeon (Algorithm) Hassibi, Babak, and David G. Stork. "Second

Full Hessian-based method: Optimal Brain Surgeon (Algorithm) Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon. ” NIPS, 1993 NETWORK COMPRESSION AND SPEEDUP 37

Full Hessian-based method: Optimal Brain Surgeon Hassibi, Babak, and David G. Stork. "Second order

Full Hessian-based method: Optimal Brain Surgeon Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon. ” NIPS, 1993 NETWORK COMPRESSION AND SPEEDUP 38

 Matrix Factorization Weight Pruning Quantization method Outline ◦ Full Quantization ◦ Fixed-point format

Matrix Factorization Weight Pruning Quantization method Outline ◦ Full Quantization ◦ Fixed-point format ◦ Code book ◦ Quantization with full-precision copy Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 39

Full Quantization : Fixed-point format Gupta, Suyog, et al. "Deep Learning with Limited Numerical

Full Quantization : Fixed-point format Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. NETWORK COMPRESSION AND SPEEDUP 40

Full Quantization : Fixed-point format (Rounding Modes) Gupta, Suyog, et al. "Deep Learning with

Full Quantization : Fixed-point format (Rounding Modes) Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. NETWORK COMPRESSION AND SPEEDUP 41

Multiply and accumulate (MACC) operation Gupta, Suyog, et al. "Deep Learning with Limited Numerical

Multiply and accumulate (MACC) operation Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. NETWORK COMPRESSION AND SPEEDUP 42

Full Quantization: Fixed-point format (Experiment on MNIST with CNNs) Gupta, Suyog, et al. "Deep

Full Quantization: Fixed-point format (Experiment on MNIST with CNNs) Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. NETWORK COMPRESSION AND SPEEDUP 44

Full Quantization: Fixed-point format (Experiment on CIFAR 10 with fully connected DNNs) Gupta, Suyog,

Full Quantization: Fixed-point format (Experiment on CIFAR 10 with fully connected DNNs) Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. NETWORK COMPRESSION AND SPEEDUP 45

Full Quantization: Code book Gong, Yunchao, et al. "Compressing deep convolutional networks using vector

Full Quantization: Code book Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization. " ar. Xiv preprint ar. Xiv: 1412. 6115 (2014). NETWORK COMPRESSION AND SPEEDUP 46

Full Quantization: Code book (Experiment on PQ) NETWORK COMPRESSION AND SPEEDUP 47

Full Quantization: Code book (Experiment on PQ) NETWORK COMPRESSION AND SPEEDUP 47

Full Quantization: Code book Gong, Yunchao, et al. "Compressing deep convolutional networks using vector

Full Quantization: Code book Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization. " ar. Xiv preprint ar. Xiv: 1412. 6115 (2014). NETWORK COMPRESSION AND SPEEDUP 48

 Matrix Factorization Weight Pruning Quantization method Outline ◦ Full quantization ◦ Quantization with

Matrix Factorization Weight Pruning Quantization method Outline ◦ Full quantization ◦ Quantization with full-precision copy ◦ Binnaryconnect ◦ BNN Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 49

Quantization with full-precision copy: Binaryconnect (Motivation) Use only two possible value (e. g. +1

Quantization with full-precision copy: Binaryconnect (Motivation) Use only two possible value (e. g. +1 or -1) for weights. Replace many multiply-accumulate operations by simple accumulations. Fixed-point adders are much less expensive both in terms of area and energy than fixed-point multiply-accumulators. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 50

Quantization with full-precision copy: Binaryconnect (Binarization) Courbariaux, et al. "Binaryconnect: Training deep neural networks

Quantization with full-precision copy: Binaryconnect (Binarization) Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 51

Quantization with full-precision copy: Binaryconnect 1. Given the DNN input, compute the unit activations

Quantization with full-precision copy: Binaryconnect 1. Given the DNN input, compute the unit activations layer by layer, leading to the top layer which is the output of the DNN, given its input. This step is referred as the forward propagation. 2. Given the DNN target, compute the training objective’s gradient w. r. t. each layer’s activations, starting from the top layer and going down layer by layer until the first hidden layer. This step is referred to as the backward propagation or backward phase of back-propagation. 3. Compute the gradient w. r. t. each layer’s parameters and then update the parameters using their computed gradients and their previous values. This step is referred to as the parameter update. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 52

Quantization with full-precision copy: Binaryconnect Binary. Connect only binarize the weights during the forward

Quantization with full-precision copy: Binaryconnect Binary. Connect only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3). Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 53

Quantization with full-precision copy: Binaryconnect 1. Binarize weights and perform forward pass. 2. Back

Quantization with full-precision copy: Binaryconnect 1. Binarize weights and perform forward pass. 2. Back propagate gradient based on binarized weights. 3. Update the full-precision weights. 4. Iterate to step 1. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 54

Quantization with full-precision copy: Binaryconnect Courbariaux, et al. "Binaryconnect: Training deep neural networks with

Quantization with full-precision copy: Binaryconnect Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations. " NIPS. 2015 NETWORK COMPRESSION AND SPEEDUP 56

Quantization with full-precision copy: Binarized Neural Networks (Motivation) Neural networks with both binary weights

Quantization with full-precision copy: Binarized Neural Networks (Motivation) Neural networks with both binary weights and activations at run-time and when computing the parameters’ gradient at train time. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 57

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks:

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 58

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks:

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 59

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks:

Quantization with full-precision copy: Binarized Neural Networks Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 61

 Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding ◦

Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding ◦ Deep Compression Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 62

Pruning + Quantization + Encoding: Deep Compression Courbariaux, Matthieu, et al. "Binarized neural networks:

Pruning + Quantization + Encoding: Deep Compression Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 63

Pruning + Quantization + Encoding: Deep Compression 1. Choose a neural network architecture. 2.

Pruning + Quantization + Encoding: Deep Compression 1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Prune the network with magnitude-based method until a reasonable solution is obtained. 4. Quantize the network with k-means based method until a reasonable solution is obtained. 5. Further compress the network with Huffman coding. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or 1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). NETWORK COMPRESSION AND SPEEDUP 64

Pruning + Quantization + Encoding: Deep Compression NETWORK COMPRESSION AND SPEEDUP 66

Pruning + Quantization + Encoding: Deep Compression NETWORK COMPRESSION AND SPEEDUP 66

 Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding Design

Matrix Factorization Weight Pruning Quantization method Outline Pruning + Quantization + Encoding Design small architecture: Squeeze. Net NETWORK COMPRESSION AND SPEEDUP 67

Design small architecture: Squeeze. Net Compression scheme on pre-trained model VS Design small CNN

Design small architecture: Squeeze. Net Compression scheme on pre-trained model VS Design small CNN architecture from scratch (also preserve accuracy? ) NETWORK COMPRESSION AND SPEEDUP 68

Squeeze. Net Design Strategies Strategy 1. Replace 3 x 3 filters with 1 x

Squeeze. Net Design Strategies Strategy 1. Replace 3 x 3 filters with 1 x 1 filters ◦ Parameters per filter: (3 x 3 filter) = 9 * (1 x 1 filter) Strategy 2. Decrease the number of input channels to 3 x 3 filters ◦ Total # of parameters: (# of input channels) * (# of filters) * ( # of parameters per filter) Strategy 3. Downsample late in the network so that convolution layers have large activation maps ◦ Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 69

Microarchitecture – Fire Module Fire module is consist of: ◦ A squeeze convolution layer

Microarchitecture – Fire Module Fire module is consist of: ◦ A squeeze convolution layer ◦ full of s 1 x 1 # of 1 x 1 filters ◦ An expand layer ◦ mixture of e 1 x 1 # of 1 x 1 and e 3 x 3 # of 3 x 3 filters Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 70

Microarchitecture – Fire Module Strategy 2. Decrease the number of input channels to 3

Microarchitecture – Fire Module Strategy 2. Decrease the number of input channels to 3 x 3 filters Total # of parameters: (# of input channels) * (# of filters) * ( # of parameters per filter) Squeeze Layer How much can we limit s 1 x 1? Set s 1 x 1 < (e 1 x 1 + e 3 x 3), limits the # of input channels to 3*3 filters Strategy 1. Replace 3*3 filters with 1*1 filters Parameters per filter: (3*3 filter) = 9 * (1*1 filter) How much can we replace 3*3 with 1*1? (e 1 x 1 vs e 3 x 3 )? Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 71

Parameters in Fire Module The # of expanded filter(ei) ei = ei, 1 x

Parameters in Fire Module The # of expanded filter(ei) ei = ei, 1 x 1 + ei, 3 x 3 The % of 3 x 3 filter in expanded layer(pct 3 x 3) ei, 3 x 3 = pct 3 x 3 * ei The Squeeze Ratio(SR) si, 1 x 1 = SR *ei Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 72

Macroarchitecture Strategy 3. Downsample late in the network so that convolution layers have large

Macroarchitecture Strategy 3. Downsample late in the network so that convolution layers have large activation maps Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture These relative late placements of pooling concentrates activation maps at later phase to preserve higher accuracy Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 73

Macroarchitecture Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50

Macroarchitecture Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 74

Evaluation of Results Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy

Evaluation of Results Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 75

Further Compression on 4. 8 M? Further Compression ◦ Deep Compression + Quantization Iandola,

Further Compression on 4. 8 M? Further Compression ◦ Deep Compression + Quantization Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " NETWORK COMPRESSION AND SPEEDUP 76

Compress Pre-trained Networks • On Single Layer: Takeaway Points • Fully connected layer: SVD

Compress Pre-trained Networks • On Single Layer: Takeaway Points • Fully connected layer: SVD • Convolutional layer: Flattened Convolutions • Weight Pruning: • Magnitude-based pruning method is simple and effective, which is the first choice for weight pruning. • Retraining is important for model compression. • Weight quantization with the full-precision copy can prevent gradient vanishing. • Weight pruning, quantization, and encoding are independent. We can use all three methods together for better compression ratio. Design a smaller CNN architecture • Example: Squeeze. Net • Use of Fire module, delay pooling at later stage NETWORK COMPRESSION AND SPEEDUP 77

Reading List • Denton, Emily L. , et al. "Exploiting linear structure within convolutional

Reading List • Denton, Emily L. , et al. "Exploiting linear structure within convolutional networks for efficient evaluation. " Advances in Neural Information Processing Systems. 2014. • Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration. " ar. Xiv preprint ar. Xiv: 1412. 5474 (2014). • Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization. " ar. Xiv preprint ar. Xiv: 1412. 6115 (2014). • Han, Song, et al. "Learning both weights and connections for efficient neural network. " Advances in Neural Information Processing Systems. 2015. • Guo, Yiwen, Anbang Yao, and Yurong Chen. "Dynamic Network Surgery for Efficient DNNs. " Advances In Neural Information Processing Systems. 2016. • Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision. " ICML. 2015. • Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Binaryconnect: Training deep neural networks with binary weights during propagations. " Advances in Neural Information Processing Systems. 2015. • Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. " ar. Xiv preprint ar. Xiv: 1602. 02830 (2016). • Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. " ar. Xiv preprint ar. Xiv: 1510. 00149 (2015). • Iandola, Forrest N. , et al. "Squeeze. Net: Alex. Net-level accuracy with 50 x fewer parameters and< 0. 5 MB model size. " ar. Xiv preprint ar. Xiv: 1602. 07360 (2016). NETWORK COMPRESSION AND SPEEDUP 78