CUDA Memory Model David KirkNVIDIA and Wenmei W

Programmer View of CUDA Memories • Each thread can: – Read/write per-thread registers (~1

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int Local. Var; register thread

Where to Declare Variables? Can host access it? Global or constant yes Outside of

$__global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P, int Width) { 1.$

A Common Programming Strategy • Global memory resides in device memory (DRAM) - slow

Shared Memory Blocking Basic Idea in Global Memory Thread 1 Thread 2 … Global

Blocking/Tiling • Good – when threads have similar access timing Thread 1 Time Thread

Matrix-Matrix Multiplication using Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS

Base Matrix Multiplication Kernel __global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P,

How about performance on Fermi? • All threads access global memory for their input

Outline of Tiling Technique • Identify a block/tile of global memory content that is

Idea: Use Shared Memory to reuse global memory data • Each input element is

Work for Block (0, 0) Col = 1 Col = 0 * (block. Dim.

bx Tiled Multiply Md 1 2 tx TILE_WIDTH Nd WIDTH 0 1 2 TILE_WIDTH-1

Loading a Tile • All threads in a block participate – Each thread loads

Work for Block (0, 0) Shared Memory, SM N 0, 0 N 1, 0

Work for Block (0, 0) N 0, 0 N 1, 0 N 2, 0

Barrier Synchronization • An API function call in CUDA __synchthreads() • All threads in

bx Loading an M Tile 0 1 2 tx TILE_WIDTH Each thread uses ty

Tiled Matrix Multiplication Kernel __global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P,

bx Loading an N Tile 0 1 2 tx Upper left corner of N

First-order Size Considerations • Each thread block should have many threads – TILE_WIDTH of

Shared Memory and Threading • Each SM in Fermi has 16 KB or 48

Summary- Typical Structure of a CUDA Program • • • Global variables declaration –

Slides: 27

Download presentation

CUDA Memory Model © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007 -2011 ECE 408/CS 483/ECE 498 al 1

Programmer View of CUDA Memories • Each thread can: – Read/write per-thread registers (~1 cycle) – Read/write per-block shared memory (~5 cycles) – Read/write per-grid global memory (~500 cycles) – Read/only per-grid constant memory {data will not be changed by current grid} (~5 cycles with caching) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 Grid Block (0, 0) Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 2

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int Local. Var; register thread __device__ __shared__ int Shared. Var; shared block __device__ int Global. Var; global grid application constant grid application __device__ __constant__ int Constant. Var; • __device__ is optional when used with __shared__, or __constant__ • Automatic variables without any qualifier reside in a register – Except per-thread arrays that reside in global memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 3

Where to Declare Variables? Can host access it? Global or constant yes Outside of any Function © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 no register (automatic) shared local In the kernel 4

$__global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P, int Width) { 1.$

__global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P, int Width) { 1. 2. __shared__ float ds_M[TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH]; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 5

A Common Programming Strategy • Global memory resides in device memory (DRAM) - slow access • So, a profitable way of performing computation on the device is to tile input data to take advantage of fast shared memory: – Partition data into subsets that fit into shared memory – Handle each data subset with one thread block by: • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element • Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 6

Shared Memory Blocking Basic Idea in Global Memory Thread 1 Thread 2 … Global Memory in On-chip Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 20072011 Thread 2 … 7

Blocking/Tiling • Good – when threads have similar access timing Thread 1 Time Thread 2 … Thread 1 time Thread 2 • Bad – when threads have very different timing © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 8

Matrix-Matrix Multiplication using Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 9

Base Matrix Multiplication Kernel __global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = block. Idx. y*TILE_WIDTH + thread. Idx. y; // Calculate the column idenx of Pd and N int Col = block. Idx. x*TILE_WIDTH + thread. Idx. x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 10

How about performance on Fermi? • All threads access global memory for their input matrix elements – – Grid Two memory accesses (8 bytes) per floating point multiply-add 4 B/s of memory bandwidth/FLOPS 4*1, 000 = 4, 000 GB/s required to achieve peak FLOP rating 150 GB/s limits the code at 37. 5 GFLOPS • The actual code runs at about 25 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 1, 000 GFLOPS © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 Host Block (0, 0) Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 11

Outline of Tiling Technique • Identify a block/tile of global memory content that is accessed by multiple threads • Load the block/tile from global memory into onchip memory • Have the multiple threads to access their data from the on-chip memory • Move on to the next block/tile © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 12

Idea: Use Shared Memory to reuse global memory data • Each input element is read by WIDTH threads. • Load each input element into Shared Memory and have several threads use the local version to M reduce the memory bandwidth WIDTH N P – Tiled algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 WIDTH ty tx WIDTH 13

Work for Block (0, 0) Col = 1 Col = 0 * (block. Dim. x) + thread. Idx. x Row = 0 * (block. Dim. y) + thread. Idx. y N 0, 0 N 1, 0 N 2, 0 N 3, 0 N 0, 1 N 1, 1 N 2, 1 N 3, 1 N 0, 2 N 1, 2 N 2, 2 N 3, 2 N 0, 3 N 1, 3 N 2, 3 N 3, 3 Row = 0 M 0, 0 M 1, 0 M 2, 0 M 3, 0 P 0, 0 P 1, 0 P 2, 0 P 3, 0 Row = 1 M 0, 1 M 1, 1 M 2, 1 M 3, 1 P 0, 1 P 1, 1 P 2, 1 P 3, 1 M 0, 2 M 1, 2 M 2, 2 M 3, 2 P 0, 2 P 1, 2 P 2, 2 P 3, 2 M 0, 3 M 1, 3 M 2, 3 M 3, 3 P 0, 3 P 1, 3 P 2, 3 P 3, 3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 14

bx Tiled Multiply Md 1 2 tx TILE_WIDTH Nd WIDTH 0 1 2 TILE_WIDTH-1 TILE_WIDTH • Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of d_M and d_N 0 Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 WIDTH by 0 1 2 TILE_WIDTHE 0 TILE_WIDTH 15

Loading a Tile • All threads in a block participate – Each thread loads one Md element and one Nd element in based tiled code • Assign the loaded element to each thread such that the accesses within each warp to global memory are coalesced (more later). © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 16

Work for Block (0, 0) Shared Memory, SM N 0, 0 N 1, 0 N 2, 0 N 3, 0 N 0, 0 N 1, 0 N 0, 1 N 1, 1 N 2, 1 N 3, 1 N 0, 1 N 1, 1 N 0, 2 N 1, 2 N 2, 2 N 3, 2 N 0, 3 N 1, 3 N 2, 3 N 3, 3 Shared Memory, SM M 0, 0 M 1, 0 M 2, 0 M 3, 0 M 0, 0 M 1, 0 P 0, 0 P 1, 0 P 2, 0 P 3, 0 M 0, 1 M 1, 1 M 2, 1 M 3, 1 M 0, 1 M 1, 1 P 0, 1 P 1, 1 P 2, 1 P 3, 1 M 0, 2 M 1, 2 M 2, 2 M 3, 2 P 0, 2 P 1, 2 P 2, 2 P 3, 2 M 0, 3 M 1, 3 M 2, 3 M 3, 3 P 0, 3 P 1, 3 P 2, 3 P 3, 3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 17

Work for Block (0, 0) N 0, 0 N 1, 0 N 2, 0 N 3, 0 SM N 0, 1 N 1, 1 N 2, 1 N 3, 1 N 0, 0 N 1, 0 N 0, 1 N 1, 1 N 0, 2 N 1, 2 N 2, 2 N 3, 2 N 0, 3 N 1, 3 N 2, 3 N 3, 3 SM M 0, 0 M 1, 0 M 2, 0 M 3, 0 M 0, 0 M 1, 0 P 0, 0 P 1, 0 P 2, 0 P 3, 0 M 0, 1 M 1, 1 M 2, 1 M 3, 1 M 0, 1 M 1, 1 P 0, 1 P 1, 1 P 2, 1 P 3, 1 M 0, 2 M 1, 2 M 2, 2 M 3, 2 P 0, 2 P 1, 2 P 2, 2 P 3, 2 M 0, 3 M 1, 3 M 2, 3 M 3, 3 P 0, 3 P 1, 3 P 2, 3 P 3, 3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 18

Work for Block (0, 0) N 0, 0 N 1, 0 N 2, 0 N 3, 0 N 0, 1 N 1, 1 N 2, 1 N 3, 1 SM N 0, 2 N 1, 2 N 2, 2 N 3, 2 N 0, 2 N 1, 2 N 0, 3 N 1, 3 N 2, 3 N 3, 3 N 0, 3 N 1, 3 SM M 0, 0 M 1, 0 M 2, 0 M 3, 0 P 0, 0 P 1, 0 P 2, 0 P 3, 0 M 0, 1 M 1, 1 M 2, 1 M 3, 1 P 0, 1 P 1, 1 P 2, 1 P 3, 1 M 0, 2 M 1, 2 M 2, 2 M 3, 2 P 0, 2 P 1, 2 P 2, 2 P 3, 2 M 0, 3 M 1, 3 M 2, 3 M 3, 3 P 0, 3 P 1, 3 P 2, 3 P 3, 3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 19

Work for Block (0, 0) N 0, 0 N 1, 0 N 2, 0 N 3, 0 N 0, 1 N 1, 1 N 2, 1 N 3, 1 N 0, 2 N 1, 2 N 2, 2 N 3, 2 SM N 0, 3 N 1, 3 N 2, 3 N 3, 3 N 0, 2 N 1, 2 N 0, 3 N 1, 3 SM M 0, 0 M 1, 0 M 2, 0 M 3, 0 P 0, 0 P 1, 0 P 2, 0 P 3, 0 M 0, 1 M 1, 1 M 2, 1 M 3, 1 P 0, 1 P 1, 1 P 2, 1 P 3, 1 M 0, 2 M 1, 2 M 2, 2 M 3, 2 P 0, 2 P 1, 2 P 2, 2 P 3, 2 M 0, 3 M 1, 3 M 2, 3 M 3, 3 P 0, 3 P 1, 3 P 2, 3 P 3, 3 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 20

Barrier Synchronization • An API function call in CUDA __synchthreads() • All threads in the same block must reach the __synchtrheads() before any can move on • Best used to coordinate tiled algorithms – To ensure that all elements of a tile are loaded – To ensure that all elements of a tile are consumed © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 21

bx Loading an M Tile 0 1 2 tx TILE_WIDTH Each thread uses ty and tx to load an element Upper left corner + ty * Width + tx Nd m bx k = by * TILE_WIDTH * Width + m * TILE_WIDTH + ty * Width + tx TILE_WIDTH Upper left corner of the M tile at step m: by * TILE_WIDTH * WIDTH + m* TILE_WIDTH 0 1 2 TILE_WIDTH-1 = (by * TILE_WIDTH + ty) * Width + m * TILE_WIDTH + tx by 1 ty m 0 1 2 by Pdsub k Row = by * TILE_WIDTH +ty TILE_WIDTH-1 TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 WIDTH = Row * Width + 0 m * TILE_WIDTH + tx Pd TILE_WIDTHE Md TILE_WIDTH 22

Tiled Matrix Multiplication Kernel __global__ void Matrix. Mul. Kernel(float* d_M, float* d_N, float* d_P, int Width) { 1. 2. __shared__ float ds_M[TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH]; 3. 4. int bx = block. Idx. x; int by = block. Idx. y; int tx = thread. Idx. x; int ty = thread. Idx. y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += ds_M[ty][k] * ds_N[k][tx]; 14. __synchthreads(); 15. } 16. d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 23

bx Loading an N Tile 0 1 2 tx Upper left corner of N tile at step m: bx*TILE_WIDTH + m*TILE_WIDTH*Width m bx = bx*TILE_WIDTH + m*TILE_WIDTH*Width + ty * Width + tx k TILE_WIDTH Each thread uses ty and tx to load an element Upper left corner + ty * Width + tx TILE_WIDTH Nd WIDTH 0 1 2 TILE_WIDTH-1 = bx*TILE_WIDTH+tx + (m*TILE_WIDTH+ty)* Width = Col + (m*TILE_WIDTH+ty)* Width by 1 ty 0 1 2 by Pdsub k TILE_WIDTH-1 TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 WIDTH m 0 Pd TILE_WIDTHE Md TILE_WIDTH 24

First-order Size Considerations • Each thread block should have many threads – TILE_WIDTH of 16 gives 16*16 = 256 threads – TILE_WIDTH of 32 gives 32*32 = 1024 threads • For TILE_WIDTH of 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8, 192 mul/add operations. q= # operations/memory access=16 • For TILE_WIDTH of 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65, 536 mul/add operations. q=32 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 25

Shared Memory and Threading • Each SM in Fermi has 16 KB or 48 KB shared memory – Shared memory size is implementation dependent! – For TILE_WIDTH = 16, each thread block uses 2*256*4 B = 2 KB of shared memory. – Can potentially have up to 8 Thread Blocks actively executing • This allows up to 8*512 = 4, 096 pending loads. (2 per thread, 256 threads per block) – The next TILE_WIDTH 32 would lead to 2*32*32*4 B= 8 KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time • Using 16 x 16 tiling, we reduce the accesses to the global memory by a factor of 16 – The 150 GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS! © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 26

Summary- Typical Structure of a CUDA Program • • • Global variables declaration – __host__ – __device__. . . __global__, __constant__, __texture__ Function prototypes – __global__ void kernel. One(…) – float handy. Function(…) Main () – allocate memory space on the device – cuda. Malloc(&d_Glbl. Var. Ptr, bytes ) – transfer data from host to device – cuda. Mem. Cpy(d_Glbl. Var. Ptr, h_Gl…) – execution configuration setup – kernel call – kernel. One<<<execution configuration>>>( args… ); – transfer results from device to host – cuda. Mem. Cpy(h_Glbl. Var. Ptr, …) – optional: compare against golden (host computed) solution Kernel – void kernel. One(type args, …) – variables declaration - auto, __shared__ • automatic variables transparently assigned to registers – syncthreads()… Other functions – float handy. Function(int in. Var…); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2011 repeat as needed 27