GPU Parallel Computing Zehuan Wang HPC Developer Technology










































![Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime](https://slidetodoc.com/presentation_image_h2/7e38f37ebc60d1f9e8835f83daf028ef/image-43.jpg)













- Slides: 56

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer © NVIDIA Corporation 2013

Access The Power of GPU Applications Libraries © NVIDIA Corporation 2013 Open. ACC Directives Programming Languages

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications NVIDIA cu. BLAS Matrix Algebra on GPU and Multicore IMSL Library © NVIDIA Corporation 2013 NVIDIA cu. SPARSE NVIDIA NPP NVIDIA cu. FFT GPU Accelerated Linear Algebra Vector Signal Image Processing NVIDIA cu. RAND Center. Space NMath Building-block Algorithms C++ Templated Parallel Algorithms

GPU Programming Languages Numerical analytics Fortran C C++ Python. NET © NVIDIA Corporation 2013 MATLAB, Mathematica, Lab. VIEW Open. ACC, CUDA Fortran Open. ACC, CUDA C++, Thrust, Hemi, Array. Fire Anaconda Accelerate, Py. CUDA, Copperhead CUDAfy. NET, Alea. cu. Base developer. nvidia. com/language-solutions

GPU Architecture © NVIDIA Corporation 2013

GPU: Massively Parallel Coprocessor A GPU is Coprocessor to the CPU or Host Has its own DRAM Runs 1000 s of threads in parallel Single Precision: 4. 58 TFlop/s Double Precision: 1. 31 TFlop/s © NVIDIA Corporation 2013

Heterogeneous Parallel Computing Logic() Compute() Latency-Optimized Fast Serial Processing © NVIDIA Corporation 2013

Heterogeneous Parallel Computing Logic() Compute() Latency-Optimized Fast Serial Processing © NVIDIA Corporation 2013 Throughput-Optimized Fast Parallel Processing

Heterogeneous Parallel Computing Logic() Compute() Latency-Optimized Fast Serial Processing © NVIDIA Corporation 2013 Throughput-Optimized Fast Parallel Processing

GPU in Computer System PCIe CPU DDR 3 DRAM Connected to CPU chipset by PCIe 16 GB/s One Way, 32 GB/s in both way © NVIDIA Corporation 2013

GPU High Level View Streaming Multiprocessor (SM) A set of CUDA cores Global memory © NVIDIA Corporation 2013

GK 110 SM Control unit 4 Warp Scheduler 8 instruction dispatcher Execution unit 192 single-precision CUDA Cores 64 double-precision CUDA Cores 32 SFU, 32 LD/ST Memory Registers: 64 K 32 -bit Cache L 1+shared memory (64 KB) Texture Constant © NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy 3 levels, very similar to CPU Register Spills to local memory Caches Shared memory L 1 cache L 2 cache Constant cache Texture cache Global memory © NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy C SM-0 SM-1 SM-N Registers L 1& SMEM TEX C L 1& SMEM TEX L 2 Global Memory © NVIDIA Corporation 2013 C L 1& SMEM TEX

Basic Concepts CPU Memory Transfer data GPU Memory PCI Bus CPU Offload computation GPU computing is all about 2 things: • Transfer data between CPU-GPU • Do parallel computing on GPU © NVIDIA Corporation 2013

GPU Programming Basics © NVIDIA Corporation 2013

How To Get Start CUDA C/C++: download CUDA drivers & compilers & samples (All In One Package ) free from: http: //developer. nvidia. com/cuda-downloads CUDA Fortran: PGI Open. ACC: PGI, CAPS, Cray © NVIDIA Corporation 2013

CUDA Programming Basics Hello World Basic syntax, compile & run GPU memory management Malloc/free memcpy Writing parallel kernels Threads & block Memory hierarchy © NVIDIA Corporation 2013

Heterogeneous Computing C Program Sequential Execution Serial code Host Executes on both CPU & GPU Similar to Open. MP’s fork-join pattern Device Grid 0 Parallel kernel Kernel 0<<<>>>() Accelerated kernels CUDA: simple extensions to C/C++ Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Host Serial code Device Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (0, 2) Block (1, 2) Parallel kernel Kernel 1<<<>>>() © NVIDIA Corporation 2013

Hello World on CPU hello_world. c: #include <stdio. h> void hello_world_kernel() { printf(“Hello Worldn”); } int main() { hello_world_kernel(); } Compile & Run: gcc hello_world. c. /a. out © NVIDIA Corporation 2013

Hello World on GPU hello_world. cu: #include <stdio. h> __global__ void hello_world_kernel() { printf(“Hello Worldn”); } int main() { hello_world_kernel<<<1, 1>>>(); } Compile & Run: nvcc hello_world. cu. /a. out © NVIDIA Corporation 2013

Hello World on GPU hello_world. cu: #include <stdio. h> __global__ void hello_world_kernel() { printf(“Hello Worldn”); } int main() { hello_world_kernel<<<1, 1>>>(); } Compile & Run: nvcc hello_world. cu. /a. out © NVIDIA Corporation 2013 CUDA kernel within. cu files compiled by nvcc CUDA kernels preceded by “__global__” CUDA kernels launched with “<<<…, …>>>”

Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions © NVIDIA Corporation 2013

CUDA C/C++ Memory Allocation / Release Host (CPU) manages device (GPU) memory: cuda. Malloc (void ** pointer, size_t nbytes) cuda. Memset (void * pointer, int value, size_t count) cuda. Free (void* pointer) int nbytes = 1024*sizeof(int); int * d_a = 0; cuda. Malloc( (void**)&d_a, nbytes ); cuda. Memset( d_a, 0, nbytes); cuda. Free(d_a); © NVIDIA Corporation 2013

Data Copies cuda. Memcpy( void *dst, void *src, size_t nbytes, enum cuda. Memcpy. Kind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cuda. Memcpy. Kind cuda. Memcpy. Host. To. Device cuda. Memcpy. Device. To. Host cuda. Memcpy. Device. To. Device Non-blocking memcopies are provided © NVIDIA Corporation 2013

Code Walkthrough 1 Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0 s Copy from GPU to CPU Print the values © NVIDIA Corporation 2013

Code Walkthrough 1 #include <stdio. h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers © NVIDIA Corporation 2013

Code Walkthrough 1 #include <stdio. h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cuda. Malloc( (void**)&d_a, num_bytes ); © NVIDIA Corporation 2013

Code Walkthrough 1 #include <stdio. h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cuda. Malloc( (void**)&d_a, num_bytes ); cuda. Memset( d_a, 0, num_bytes ); cuda. Memcpy( h_a, d_a, num_bytes, cuda. Memcpy. Device. To. Host ); © NVIDIA Corporation 2013

Code Walkthrough 1 #include <stdio. h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cuda. Malloc( (void**)&d_a, num_bytes ); cuda. Memset( d_a, 0, num_bytes ); cuda. Memcpy( h_a, d_a, num_bytes, cuda. Memcpy. Device. To. Host ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("n"); free( h_a ); cuda. Free( d_a ); return 0; } © NVIDIA Corporation 2013

Compile & Run nvcc main. cu. /a. out 00000000 © NVIDIA Corporation 2013

Thread Hierarchy 2 -level hierarchy: blocks and grid Block = a group of up to 1024 threads Grid = all blocks for a given kernel launch E. g. total 72 threads block. Dim=12, grid. Dim=6 Grid 0 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) A block can: Synchronize their execution Communicate via shared memory Size of grid and blocks are specified during kernel launch dim 3 grid(6, 1, 1), block(12, 1, 1); kernel<<<grid, block>>>(…); © NVIDIA Corporation 2013

IDs and Dimensions Threads: 3 D IDs, unique within a block Device Grid 1 Blocks: 3 D IDs, unique within a grid Built-in variables: thread. Idx: idx within a block. Idx: idx within the grid block. Dim: block dimension grid. Dim: grid dimension © NVIDIA Corporation 2013 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2)

GPU and Programming Model Software GPU CUDA Core Threads are executed by scalar processors Thread blocks are executed on multiprocessors Thread Block Multiprocessor A kernel is launched as a grid of thread blocks . . . Grid © NVIDIA Corporation 2013 Device

Which thread do I belong to? block. Dim. x = 4, grid. Dim. x = 4 thread. Idx. x: 0 1 2 3 block. Idx. x: 0 0 1 1 2 2 3 3 idx = block. Idx. x*block. Dim. x + thread. Idx. x: © NVIDIA Corporation 2013 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Code Walkthrough 2: Simple Kernel Allocate memory on GPU Copy the data from CPU to GPU Write a kernel to perform a vector addition Copy the result to CPU Free the memory © NVIDIA Corporation 2013

Vector Addition using C void vec_add(float *x, float *y, int n) { for (int i=0; i<n; ++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x, y, n); free(x); free(y); © NVIDIA Corporation 2013

Vector Addition using CUDA C __global__ void vec_add(float *x, float *y, int n) { int i=block. Idx. x*block. Dim. x+thread. Idx. x; y[i]=x[i]+y[i]; } float *d_x, *d_y; cuda. Malloc(&d_x, n*sizeof(float)); cuda. Malloc(&d_y, n*sizeof(float)); cuda. Memcpy(d_x, x, n*sizeof(float), cuda. Memcpy. Host. To. Device); cuda. Memcpy(d_y, y, n*sizeof(float), cuda. Memcpy. Host. To. Device); vec_add<<<n/128, 128>>>(d_x, d_y, n); cuda. Memcpy(y, d_y, n*sizeof(float), cuda. Memcpy. Device. To. Host); cuda. Free(d_x); cuda. Free(d_y); © NVIDIA Corporation 2013

Vector Addition using CUDA C __global__ void vec_add(float *x, float *y, int n) { int i=block. Idx. x*block. Dim. x+thread. Idx. x; y[i]=x[i]+y[i]; } float *d_x, *d_y; cuda. Malloc(&d_x, n*sizeof(float)); cuda. Malloc(&d_y, n*sizeof(float)); cuda. Memcpy(d_x, x, n*sizeof(float), cuda. Memcpy. Host. To. Device); cuda. Memcpy(d_y, y, n*sizeof(float), cuda. Memcpy. Host. To. Device); vec_add<<<n/128, 128>>>(d_x, d_y, n); cuda. Memcpy(y, d_y, n*sizeof(float), cuda. Memcpy. Device. To. Host); cuda. Free(d_x); cuda. Free(d_y); © NVIDIA Corporation 2013 Keyword for CUDA kernel

Vector Addition using CUDA C __global__ void vec_add(float *x, float *y, int n) { int i=block. Idx. x*block. Dim. x+thread. Idx. x; y[i]=x[i]+y[i]; } float *d_x, *d_y; cuda. Malloc(&d_x, n*sizeof(float)); cuda. Malloc(&d_y, n*sizeof(float)); cuda. Memcpy(d_x, x, n*sizeof(float), cuda. Memcpy. Host. To. Device); cuda. Memcpy(d_y, y, n*sizeof(float), cuda. Memcpy. Host. To. Device); vec_add<<<n/128, 128>>>(d_x, d_y, n); cuda. Memcpy(y, d_y, n*sizeof(float), cuda. Memcpy. Device. To. Host); cuda. Free(d_x); cuda. Free(d_y); © NVIDIA Corporation 2013 Thread index computation to replace loop

GPU Memory Model Review Thread Per-thread Local Memory Block Per-block Shared Memory Kernel 0 Sequential Kernels . . . Kernel 1. . . © NVIDIA Corporation 2013 Per-device Global Memory

Global Memory Kernel 0. . . Kernel 1 Per-device Global Memory . . . Data lifetime = from allocation to deallocation Accessible by all threads as well as host (CPU) © NVIDIA Corporation 2013 Sequential Kernels
![Shared Memory CC shared int aSIZE Allocated per threadblock Data lifetime block lifetime Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime](https://slidetodoc.com/presentation_image_h2/7e38f37ebc60d1f9e8835f83daf028ef/image-43.jpg)
Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime Accessible by any thread in the threadblock Not accessible to other threadblocks © NVIDIA Corporation 2013 Block Per-block Shared Memory

Registers Thread Per-thread Local Storage Automatic variables (scalar/array) inside kernels Data lifetime = thread lifetime Accessible only by the thread declares it © NVIDIA Corporation 2013

Example of Using Shared Memory Applying a 1 D stencil to a 1 D array of elements: Each output element is the sum of all elements within a radius For example, for radius = 3, each output element is the sum of 7 input elements: radius © NVIDIA Corporation 2013 radius

Example of Using Shared Memory … 1 2 3 4 5 6 7 8 … …… 28………………… © NVIDIA Corporation 2013

Kernel Code Using Global Memory One element per thread __global__ void stencil(int* in, int* out) { int glob. Idx = block. Idx. x * block. Dim. x + thread. Idx. x; int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += in[glob. Idx + offset]; out[glob. Idx] = value; } A lot of redundant read in neighboring threads: not an optimized way © NVIDIA Corporation 2013

Implementation with Shared Memory One element per thread Read (BLOCK_SIZE + 2 * RADIUS) elements from global memory to shared memory Compute BLOCK_SIZE output elements in shared memory Write BLOCK_SIZE output elements to global memory “halo” = RADIUS elements on the left © NVIDIA Corporation 2013 The BLOCK_SIZE input elements corresponding to the output elements “halo” = RADIUS elements on the right

Kernel Code RADIUS = 3 BLOCK_SIZE = 16 __global__ void stencil(int* in, int* out) { __shared__ int shared[BLOCK_SIZE + 2 * RADIUS]; int glob. Idx = block. Idx. x * block. Dim. x + thread. Idx. x; int loc. Idx = thread. Idx. x + RADIUS; shared[loc. Idx] = in[glob. Idx]; if (thread. Idx. x < RADIUS) { shared[loc. Idx – RADIUS] = in[glob. Idx – RADIUS]; shared[loc. Idx + BLOCK_DIMX] = in[glob. Idx + BLOCK_SIZE]; } __syncthreads(); int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += shared[loc. Idx + offset]; out[glob. Idx] = value; } © NVIDIA Corporation 2013

Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a thread block Since threads are scheduled at run-time Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block Otherwise may lead to deadlock © NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy C SM-0 SM-1 SM-N Registers L 1& SMEM TEX C L 1& SMEM TEX L 2 Global Memory © NVIDIA Corporation 2013 C L 1& SMEM TEX

Constant Cache Global variables marked by __constant__ are constant and can’t be changed in device. Will be cached by Constant Cache Located in global memory Good for threads access the same address __constant__ int a=10; __global__ void kernel() { a++; //error } . . . Memory addresses © NVIDIA Corporation 2013

Texture Cache SMX Save Data as Texture : Provides hardware accelerated filtered sampling of data (1 D, 2 D, 3 D) Read-only data cache holds fetched samples Backed up by the L 2 cache Why use it? Separate pipeline from shared/L 1 Highest miss bandwidth Flexible, e. g. unaligned accesses © NVIDIA Corporation 2013 Tex Tex Read-only Data Cache L 2

Texture Cache Unlocked In GK 110 SMX Added a new path for compute Avoids the texture unit Allows a global address to be fetched and cached Eliminates texture setup Tex Tex Managed automatically by compiler “const __restrict” indicates eligibility Read-only Data Cache L 2 © NVIDIA Corporation 2013

const __restrict Annotate eligible kernel parameters with const __restrict Compiler will automatically map loads to use read-only data cache path © NVIDIA Corporation 2013 __global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = thread. Idx. x + (block. Idx. x * block. Dim. x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; }

References Manuals Programming Guide Best Practice Guide Books CUDA By Examples, Tsinghua University Press Training videos GTC talks online: optimization, advanced optimization + hundreds of other GPU computing talks http: //www. gputechconf. com/gtcnew/on-demand-gtc. php NVIDIA GPU Computing webinars http: //developer. nvidia. com/gpu-computing-webinars Forum © NVIDIA Corporation 2013 http: //cudazone. nvidia. cn/forum. php