A Specialized Arithmetic Block for FPGABased Acceleration of
A Specialized Arithmetic Block for FPGA-Based Acceleration of CNNs Alan Mishchenko October 17, 2020
Neural Networks Let. Net-5 (1998) 2
Xilinx DSP Block This block implements Out = Out + a * b. It is overdesigned for the use in HW accelerators, which perform only Kx. K bit multiplication where K ≤ 8. However, there is a work-around: https: //www. xilinx. com/support/documentation/white_papers/wp 486 -deep-learning-int 8. pdf 3
Xilinx DSP Block in 2 x Mode Implementing Out 1 = Out 1 + a * c and Out 2 = Out 2 + b * c in one cycle 4
Motivation l The goal is to have faster FPGA-based acceleration l l We consider only post-training quantization l l To have faster acceleration - we need more DSPs To create more DSPs from LUTs – we need smaller data bit-width To have smaller data bit-width – we need to improve quantization For both weights (W) and activations (A) Traditionally, 8 -bit quantization (8 W / 8 A) is widely used l The best results are l l l 4 W / 8 A (1% loss) (https: //arxiv. org/pdf/1912. 09356. pdf) 4 W / 4 A (3% loss) (https: //arxiv. org/pdf/1911. 07190 v 2. pdf) Contributions of this presentation l l l Demonstrate experimentally that 4 W / 4 A is possible Describe a method to quantize for 4 W / 4 A Design a new hardware DSP module for 4 W / 4 A 5
An Acceleration Case Study l CNN: Resnet-50 trained on Image. Net (https: //arxiv. org/abs/1512. 03385) l l FPGA platform: Ultra 96 v 2 Board (https: //www. avnet. com) with Xilinx Zynq Ultra. Scale+ ZU 3 EG l l l 216 4 KB BRAMs (~0. 86 MB), 360 DSPs, 70 K LUTs, 140 K FFs With 400 DSPs (360 built-in + 40 LUT-based) @ 200 MHz, we get l l 224 x 224 image size, 53 convo layers, 8 GOP (4 B MACC), 25 M weights 400 DSP * 200 M / 4 B MACC = 20 FPS We use ~30 K LUTs for the accelerator and ~30 K LUTs for extra 40 DSPs If we use 2 x clock frequency in DSPs, we get 40 FPS If we use 2 x bit-width trick in DSPs, we get 80 FPS If we synthesize additional 800 DSPs, we get 160 FPS l We can build more LUT-based DSPs, if we reduce data bit-width l l The limit is: no more than 30, 000 LUTs / 800 DSPs = 37. 5 LUTs/DSP We may be able to achieve this using 4 W / 4 A quantization! 160 FPS for Resnet-50, is 160 FPS * 8 GOP = 1, 280 GOP/sec (INT 4)6 Assuming power consumption is 2 W, we get 640 GOP/sec/W
Experiment to Confirm 4 W / 4 A Single-precision floating-point number (32 bits) For each weight and activation, set mantissa bits 0, 1, 2, … 18 to zero. Can we have a good accuracy? If yes, then 4 W / 4 A is feasible! 7
Quantization to 4 bits l We begin by balancing activations l l This allows us to represent activations with mantissa only Next, we balance weights channel-wise l l Normally, we round each float value to the nearest integer However, for better quantization, we need to choose rounding direction to minimize the error at the layer output l l This can be done by a greedy algorithm, which is faster than the one presented in https: //arxiv. org/pdf/1912. 09356. pdf In the end, we have weights of each channel represented by 4 -bit mantissa and 4 -bit shared exponent l l The storage requirements are 4*N+3 where N is the number of weights and 3 stands for (1) shared exponent, (2) bias exponent, (3) bias mantissa We need only a 4 x 4 multiplier! 8
Quantization l Quantization turns floating-point computations into fixed-point ones: https: //sahnimanas. github. io/post/quantization-in-tflite/ 9
Proposed DSP Module Out Register (Accumulator) 16 Adder 16 16 (~20 LUT 6) 4 Shifter W exponent (shared for one channel) 8 (~20 LUT 6) Multiplier 4 A l 4 W mantissa This module performs Out = Out + A * W l l bias B is added by setting A = 1 and W = B initial value V (operation “add to”) is added by setting A = V and W =1 10
Conclusion l l l Demonstrated experimentally that 4 W / 4 A is possible Outlined a method to quantize for this bit-width Designed a hardware module to match this quantization 11
Resnet-50 Trained on Image. Net # Network "proto/resnet 050. prototxt" printed by nncomp version 200522 on Sat May 23 06: 19: 36 2020 # N , F 0 , F 1 , Type , Chan , K , G , P , R , Size , Addr , Param , Activ , Macc , -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 , , , input , , , convo , , , maxpool , , , convo , 1 , , convo , , 2 , convo , , 5 , convo , , 8 , convo , , , convo , 11 , , convo , , 12 , convo , , 15 , convo , , 18 , convo , , 21 , convo , , , convo , 24 , , convo , , 25 , convo , , 28 , convo , , , convo , 3 , 0 , 0 , 64 , 7 , 0 , 3 , 1 , 64 , 3 , 0 , 0 , 256 , 1 , 0 , 0 , 64 , 1 , 0 , 0 , 1 , 64 , 3 , 0 , 1 , 1 , 256 , 1 , 0 , 0 , 1 , 64 , 1 , 0 , 1 , 64 , 3 , 0 , 1 , 256 , 1 , 0 , 1 , 512 , 1 , 0 , 0 , 128 , 1 , 0 , 0 , 1 , 128 , 3 , 0 , 1 , 1 , 512 , 1 , 0 , 0 , 1 , 1024 , 1 , 0 , 0 , 256 , 1 , 0 , 0 , 1 , 256 , 3 , 0 , 1 , 1 , 1024 , 1 , 0 , 0 , 1 , 256 , 1 , 0 , 1 , 256 , 3 , 0 , 1 , , , , , , , , , 0 2 2 1 1 1 1 1 1 2 2 1 1 1 1 , , , , , , , , , 224 112 56 56 56 28 28 28 28 14 14 14 , , , , , , , , , 1024 4096 1024 4096 2048 512 512 2048 1024 256 256 1024 256 , , , , , , , , , , 9472 , , 16640 , 4160 , 36928 , 16640 , 16448 , 36928 , 16640 , 131584 , 32896 , 147584 , 66048 , 65664 , 147584 , 66048 , 525312 , 131328 , 590080 , 263168 , 262400 , 590080 , , 802816 , 200704 , 802816 , 200704 , 802816 , 401408 , 100352 , 100352 , 401408 , 200704 , 50176 , 200704 , 50176 , , 118013952 , , 51380224 , 12845056 , 115605504 , 51380224 , 51380224 , 115605504 , 51380224 , 102760448 , 25690112 , 115605504 , 51380224 , 115605504 , 51380224 , 115605504 , Cycle , , 460992 , , 200704 , 50176 , 451584 , 200704 , 200704 , 451584 , 200704 , 401408 , 100352 , 451584 , 200704 , 451584 , 200704 , 451584 , 12
Resnet-50 Trained on Image. Net N , F 0 , F 1 , Type 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 # # , , 31 , convo , , 34 , convo , , 37 , convo , , 40 , convo , , , convo , 43 , , convo , , 44 , convo , , 47 , convo , , 50 , convo , , , avepool , , , fullcon , , , softmax , , , 53 conv , Chan , K , G , P , R , Size , Addr , , 1024 , 256 , 1024 , 256 , 1024 , 2048 , 512 , 2048 , 512 , 2048 , 1000 , 5 , , 29680 , , , , , , , 1 1 3 1 1 3 1 7 1 0 , , , , , , , 0 0 0 0 0 0 , , , , , , , 0 0 1 0 0 1 0 0 0 0 , , , , , , , 1 1 1 1 1 0 , , , , , , , 1 1 1 1 1 2 2 1 1 1 1 1 0 , , , , , , , 14 14 14 7 7 7 7 7 1 1 1 , , , , , , , 1024 256 256 1024 512 128 128 512 250 1 , , , , , , , Param , 263168 262400 590080 263168 2099200 524800 2359808 1050624 1049088 2359808 1050624 Activ , , 200704 , , 50176 , , 200704 , , 50176 , , 200704 , , 100352 , , 25088 , , 100352 , , 25088 , , 100352 , , , , 25530472, 10941421, 97. 39 MB, 41. 74 MB, Macc , Cycle , 51380224 115605504 51380224 102760448 25690112 115605504 51380224 115605504 51380224 , 200704 , , 451584 , , 200704 , , 401408 , , 100352 , , 451584 , , 200704 , , 451584 , , 200704 , , , , 3855925248, 15062208, 7. 71 GOPS, 13. 3 FPS, 13
Yolo-Nano Trained on COCO # Network "cnn/yolo-nano-coco. cnn " "cnn/yolo-nano-coco. cnn" # N , F 0 , F 1 , Type , Chan , K -1 , , , input , 3 , 0 0 , , , convo , 16 , 3 1 , , , convo , 16 , 3 2 , , , convo , 8 , 1 3 , , , convo , 48 , 1 4 , , , convo , 48 , 3 5 , , , convo , 12 , 1 6 , , , convo , 72 , 1 7 , , , convo , 72 , 3 8 , , 5 , convo , 12 , 1 9 , , , convo , 72 , 1 10 , , , convo , 72 , 3 11 , , , convo , 16 , 1 12 , , , convo , 96 , 1 13 , , , convo , 96 , 3 14 , , 11 , convo , 16 , 1 15 , , , convo , 96 , 1 16 , , , convo , 96 , 3 17 , , 14 , convo , 16 , 1 18 , , , convo , 96 , 1 19 , , , convo , 96 , 3 20 , , , convo , 32 , 1 21 , , , convo , 192 , 1 22 , , , convo , 192 , 3 23 , , 20 , convo , 32 , 1 24 , , , convo , 192 , 1 25 , , , convo , 192 , 3 26 , , 23 , convo , 32 , 1 27 , , , convo , 192 , 1 28 , , , convo , 192 , 3 29 , , 26 , convo , 32 , 1 30 , , , convo , 192 , 1 31 , , , convo , 192 , 3 32 , , , convo , 48 , 1 33 , , , convo , 288 , 1 printed by nncomp version 200814 on Tue Aug 18 16: 38: 22 2020 , , , , , , , , , G 0 0 1 0 0 1 0 0 1 0 0 , , , , , , , , , P 0 1 1 0 0 1 0 0 1 0 0 , , , , , , , , , R 0 1 1 0 1 1 0 1 1 0 1 , , , , , , , , , S 0 2 1 1 1 1 1 1 1 1 1 1 , Size , Addr , , 320 , 1024 , , 160 , 512 , , 160 , 3072 , , 80 , 768 , , 80 , 192 , , 80 , 1152 , , 40 , 288 , , 40 , 64 , , 40 , 384 , , 40 , 64 , , 40 , 384 , , 20 , 96 , , 20 , 32 , , 20 , 192 , , 20 , 192 , , 20 , 48 , , 20 , 288 , Param , Activ , , , 448 , 409600 , 160 , 409600 , 136 , 204800 , 432 , 1228800 , 480 , 307200 , 588 , 76800 , 936 , 460800 , 720 , 460800 , 876 , 76800 , 936 , 460800 , 720 , 115200 , 1168 , 25600 , 1632 , 153600 , 960 , 153600 , 1552 , 25600 , 1632 , 153600 , 960 , 38400 , 3104 , 12800 , 6336 , 76800 , 1920 , 76800 , 6176 , 12800 , 6336 , 76800 , 1920 , 76800 , 9264 , 19200 , 14112 , 115200 , Macc , , 11059200 , 3686400 , 3276800 , 9830400 , 2764800 , 3686400 , 5529600 , 4147200 , 5529600 , 1036800 , 1843200 , 2457600 , 1382400 , 2457600 , 345600 , 1228800 , 2457600 , 691200 , 2457600 , 691200 , 3686400 , 5529600 , Cycle , , 43200 , 14400 , 12800 , 38400 , 10800 , 14400 , 21600 , 16200 , 21600 , 4050 , 7200 , 9600 , 5400 , 9600 , 1350 , 4800 , 9600 , 2700 , 9600 , 2700 , 14400 , 21600 , 14
Yolo-Nano Trained on COCO N , F 0 , F 1 , Type 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 # # , , , convo , , 32 , convo , , 35 , convo , , , convo , , 41 , convo , , 44 , convo , , , maxpool , 51 , 47 , concat , , , convo , , , convo , , , yolo , 53 , , interp , , 39 , concat , , , convo , , , convo , , , yolo , , , 59 conv , Chan , K , G , P , R , Size , Addr , Param , , 288 , 48 , 288 , 80 , 480 , 80 , 80 , 320 , 288 , 96 , 384 , 255 , 288 , 576 , 80 , 288 , 192 , 288 , 255 , , 11487 2880 13872 14112 2880 23120 38880 4800 38480 , , , , , , , , , 3 1 1 3 1 3 3 0 1 3 1 1 1 0 1 1 3 1 1 1 0 , , , , , , , , , 1 0 0 1 0 0 0 0 1 0 0 , , , , , , , , , 1 0 0 1 0 1 1 0 0 0 0 1 0 0 , , , , , , , , , 1 0 1 1 0 0 0 1 1 1 0 0 , , , , , , , , , 1 1 1 2 1 1 1 0 1 1 1 0 2 0 1 1 1 0 , , , , , , , , , 20 20 20 10 10 10 10 10 20 20 20 , , , , , , , , , 288 48 288 72 20 120 120 20 80 72 72 24 96 63 63 288 576 80 288 192 288 255 , , , , , , , , , Activ , , 115200 , , 19200 , , 115200 , , 28800 , , 8000 , , 48000 , , 8000 , , , 92448 , 28800 , 27744 , 9600 , 37248 , 38400 , 98175 , 25500 , , , , 46160 , 32000 , 23328 , 115200 , 2880 , 115200 , 55488 , 76800 , 55584 , 115200 , 73695 , 102000 , , , 864190, 8042700, 3. 30 MB, 30. 68 MB, Macc , Cycle , 1036800 5529600 259200 2304000 3840000 432000 3840000 , 4050 , , 21600 , , 1012 , , 9000 , , 15000 , , 1687 , , 15000 , , , 9216000 , 36000 , 259200 , 1012 , 2764800 , 10800 , 3686400 , 14400 , 9792000 , 38250 , , , , 18432000 , 72000 , 9216000 , 36000 , 1036800 , 4050 , 22118400 , 86400 , 29376000 , 114750 , , , 274726400, 1073148, 0. 55 GOP, 186. 4 FPS, 15
- Slides: 15