A Versatile Software Systolic Execution Model for GPU

  • Slides: 36
Download presentation
A Versatile Software Systolic Execution Model for GPU Memory-bound Kernels 1. Tokyo Institute of

A Versatile Software Systolic Execution Model for GPU Memory-bound Kernels 1. Tokyo Institute of Technology, Dept. of Mathematical and Computing Science, Tokyo, Japan 2. National Institute of Advanced Industrial Science and Technology, Tokyo, Japan 3. AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology 4. RIKEN Center for Computational Science, Hyogo, Japan

Background • Post-Moore era: Specialization • Memory-bound problems [1, 2] • Solutions Data Inaccurate

Background • Post-Moore era: Specialization • Memory-bound problems [1, 2] • Solutions Data Inaccurate Post-Moore Computing Systems Specialization (Low/Mixed Precision) Hardware Reconfigurable • Reducing required memory bandwidth • Improving data locality • Efficient Inter-thread communication Low-power Memory Unreliable Neuromophic Computing Quantum Computing Moore’s Era Computing Systems Reliable Exact Compute Inexact [1] M. Wahib and N. Maruyama, "Scalable Kernel Fusion for Memory-Bound GPU Applications, " SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, 2014, pp. 191 -202. [2] J. Domke et al. , "Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches? , " 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 78 -88. 2

Memory-bound kernels on GPUs • Memory-bound problem on GPUs • Very high flops-to-bytes ratio

Memory-bound kernels on GPUs • Memory-bound problem on GPUs • Very high flops-to-bytes ratio • Typical memory bound kernels • • Scan operator Summed Area Table (SAT) [1] Convolution Stencils Flops to Bytes ratio of GPUs in single precision [1] P. Chen, M. Wahib, S. Takizawa, R. Takano and S. Matsuoka, "Efficient Algorithms for the Summed Area Tables Primitive on GPUs, " 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, 2018, pp. 482 -493. 3

 • Systolic array architecture Motivation • Heavily-researched parallel computer architecture • Simple yet

• Systolic array architecture Motivation • Heavily-researched parallel computer architecture • Simple yet effective to optimize applications • Outstanding performance with limited memory bandwidth • High FLOPS, High Flops/wat • A revived architecture • Driven by the compute-intensive workload • Google TPU, Nvidia Tensor core • Should we embrace systolic Arrays? • Specialization Nvidia Tensor Core • Innovate to build a software systolic array (SSAM) • Why is SSAM needed? • Programmability, generality • High performance Google TPU 4

Problem statement • How to mimic the behavior of systolic array? • • Understanding

Problem statement • How to mimic the behavior of systolic array? • • Understanding memory-bound kernels Understanding hardware systolic array How to achieve high performance by software systolic array? How to qualify the performance advance? • How to take advantage of CUDA architecture? • • Register cache & shared memory cache In-register partial sums computation Coalesced access to global memory Shuffle intrinsic for intra-warp communication • Note : SSAM is not limited to Nvidia GPUs 5

Contributions 1. We design a versatile software systolic array execution model on CUDA for

Contributions 1. We design a versatile software systolic array execution model on CUDA for optimizing memory-bound kernels 2. We build performance model for analyzing the data reuse and qualifying the efficiency and limitations of SSAM 3. We map typical memory-bound algorithms in SSAM • 2 D convolution, 2 D/3 D stencils • Achieve outstanding performance • 2. 5 x faster than Nvidia’s NPP library for 2 D convolution • Outperform the state of the art stencil libraries for a wide range of stencils 6

Nvidia CUDA • Built around an array of multi-threaded streaming processors (SMs) Global GPU

Nvidia CUDA • Built around an array of multi-threaded streaming processors (SMs) Global GPU memory • CUDA execution hierarchy • • Thread Warp Block Grid • CUDA memory hierarchy • • • Global memory Shared memory Constant memory Register memory L 1/L 2 cache GPU Kernel Shared memory Thread block Warp Registers Thread Memory hierarchy in CUDA GPUs 7

Software Systolic Array Model (SSAM) CUDA threads in a warp (a) Hardware Systolic Array

Software Systolic Array Model (SSAM) CUDA threads in a warp (a) Hardware Systolic Array reg reg reg Partial Sums intra -warp Partial Sums in a thread (b) Software Systolic Array Execution Model (SSAM) Warp Register cache SSAM Partial sums 8

Register cache Vs. Shared memory cache • Advantages of register cache • High bandwidth

Register cache Vs. Shared memory cache • Advantages of register cache • High bandwidth • Low latency (Register is 1 cycle Vs. shared mem is 27 cycles on V 100) • Large capacity • Disadvantages of register cache • Only threads in a warp can exchange data • Warp. Size is limited as 32 9

Mapping Algorithms in SSAM • A regular algorithm can be formulated as a four-tuple:

Mapping Algorithms in SSAM • A regular algorithm can be formulated as a four-tuple: • Illustrative Example: 1 D Convolution/Stencil X: Y: 10

SSAM-based 2 D convolution • 2 D convolution (1) Where, M=d-c+1, N=b-a+1, * means

SSAM-based 2 D convolution • 2 D convolution (1) Where, M=d-c+1, N=b-a+1, * means convolution, . is multiplication (a) Register Cache (b) Compute partial sums e 0 #2 e 1 + e 0 e 1 #3 e 2 + + + e 3 + e 4 e 4 e 5 + + e 5 e 6 + + e 6 d 3 1 d 3 0 #1 Th rea d 2 Th rea d 3 Th rea d 4 Th rea d 5 Th rea d 0 Th rea Sliding Window A CUDA Warp A CUDA thread rea Th Th rea d 3 0 d 3 1 A CUDA Warp Th r Th ead 1 rea d 2 Th rea d 3 Th rea d 4 Th rea d 5 Th rea d 0 • Implementation e 30 e 31 + + e 30 e 31 Convolution Results (c) Transfer partial sums 11

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache 12

An example of partial sums • 8 threads in a single warp e e

An example of partial sums • 8 threads in a single warp e e e 5 4 3 2 1 0 7 6 e e Register cache 13

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache e e 14

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e e Register cache e e 15

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e e Register cache e e 16

An example of partial sums • 8 threads in a single warp e 5

An example of partial sums • 8 threads in a single warp e 5 4 3 2 1 0 7 6 e e e Register cache e e 17

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache e e 18

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache 19

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache 20

An example of partial sums • 8 threads in a single warp e 5

An example of partial sums • 8 threads in a single warp e 5 4 3 2 1 0 7 6 e e Register cache e e 21

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache e e 22

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e Register cache e e 23

An example of partial sums • 8 threads in a single warp 5 4

An example of partial sums • 8 threads in a single warp 5 4 3 2 1 0 7 6 e e e e Register cache e e 24

Proposed algorithm on CUDA Registers T : data type B : block size P

Proposed algorithm on CUDA Registers T : data type B : block size P : number of processed elements by each thread M : filter width N : filter height Load Compute Store 2 D convolution CUDA kernel 2 D stencil CUDA kernel 25

Performance model • Micro-benchmarking [1] • Understand the characteristics of GPUs • Measure the

Performance model • Micro-benchmarking [1] • Understand the characteristics of GPUs • Measure the constant parameters • Latency of arithmetic operations • Latency of shared memory • Latency of shuffle • Justify the performance advance of in-register computation in SSAM • Convolution kernel as example • Partial Sums Computation for an element • Guide the design of partial sums transfer path • The partial sums transfer path is critical to the performance of SSAM [1] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi and A. Moshovos, "Demystifying GPU microarchitecture through microbenchmarking, " 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), White Plains, NY, 2010, pp. 235 -246. 26

The measured latency of different operations • We report the measured latency results as

The measured latency of different operations • We report the measured latency results as in table 27

Efficient Partial Sums Computation in SSAM Where, M=d-c+1, N=b-a+1, * means convolution, . is

Efficient Partial Sums Computation in SSAM Where, M=d-c+1, N=b-a+1, * means convolution, . is multiplication (1) Using shared memory Latency : (2) Using shuffle Latency (SSAM) : Carefully design SSAM such that: • (2) < (1) • With optimal partial sum paths • With highest locality 28

 • Evaluation Environment k : stencil order FPP : FLOP per point 29

• Evaluation Environment k : stencil order FPP : FLOP per point 29

Evaluation for 2 D Convolution • 2 D convolution evaluation in single precision •

Evaluation for 2 D Convolution • 2 D convolution evaluation in single precision • Arrayfire, Nvidia NPP, Halide, cu. DNN, cu. FFT • Observation • SSAM achieves the highest performance & scalability • Performance gap become smaller on V 100 GPU than P 100 The lower, the better 30

Evaluation for Stencils in Single Precision • Stencils: low order & high order, star

Evaluation for Stencils in Single Precision • Stencils: low order & high order, star shape & box shape , 2 D & 3 D • “original”, “reordered”, “unrolled” are provided by Rawat et al. The higher, the better • Halide and ppcg can automatically generate code for GPUs Single precision on P 100 GPU Single precision on V 100 GPU 31

Evaluation for Stencil in Double Precision • Stencils: low order & high order, star

Evaluation for Stencil in Double Precision • Stencils: low order & high order, star shape & box shape , 2 D & 3 D • “original”, “reordered”, “unrolled” are provided by Rawat et al. The higher, the better • Halide and ppcg can automatically generate code for GPUs Double precision on P 100 GPU Double precision on V 100 GPU 32

Evaluation for Stencil with Temporal/Spatial Blocking GCells/s The higher, the better • STENCILGEN, domain-specified

Evaluation for Stencil with Temporal/Spatial Blocking GCells/s The higher, the better • STENCILGEN, domain-specified DSL for stencil computation (Rawat et al. ) • Diffusion, a 3 d 7 pt stencil library, proposed Maruyama, et al. • Bricks, a general stencil library targeting both CPUs and GPUs 33

Insight into Pascal and Volta architecture • L 1 cache in Volta is enhanced

Insight into Pascal and Volta architecture • L 1 cache in Volta is enhanced • The capacity of L 1 cache in Volta is more than 7 times larger than Pascal • V 100 has increased up to 128 KB • Access latency of L 1 cache, V 100 is about 2. 8 times faster than the P 100 • V 100 is 28 cycles vs. P 100 is 82 cycles • L 2 cache in Volta is improved • The capacity of L 2 cache in V 100 increases by 50% • P 100 is 4096 KB vs. V 100 is 6144 KB • Access latency improves by up to 10% • P 100 is ≤ 234 cycles vs. V 100 is ≤ 193 cycles • Resulting in narrowing the gap between applications • Those manually optimized by caching data in registers (or shared memory) • Those access global memory directly 34

Conclusion • Inspired by the mechanism of hardware systolic array • Build a versatile

Conclusion • Inspired by the mechanism of hardware systolic array • Build a versatile software systolic array execution model • Outstanding performance for memory-bound kernels • 2 D convolution, 2 D/3 D stencils, SCAN, Summed Area Table [1], etc. • Performance of 2 D convolution • 2. 5 x faster than Nvidia’s NPP library • Up to 1. 5 x faster than the Array. Fire library • We should embrace systolic array architecture • Efficient specialization [1] P. Chen, M. Wahib, S. Takizawa, R. Takano and S. Matsuoka, "Efficient Algorithms for the Summed Area Tables Primitive on GPUs, " 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, 2018, pp. 482 -493. 35

Future work • Generate SSAM codes automatically using polyhedral model • Polyhedral model extracts

Future work • Generate SSAM codes automatically using polyhedral model • Polyhedral model extracts the dependency • Overlap operations • Load, compute, and store • Register optimization • Register allocation • Alleviate register pressure • Experiment with SSAM for compute-bound problems • Matrix Multiplication • 3 D convolution https: //github. com/pengdada/ssam-sc 19 36