Introduction to CUDA heterogeneous programming Katia Oleinik koleinikbu
Introduction to CUDA heterogeneous programming Katia Oleinik koleinik@bu. edu Scientific Computing and Visualization Boston University
GPU memory CUDA Basics CUDA • Hello, World! • CUDA kernels • Blocks and threads overview • Architecture • C Language extensions • Terminology • Memory management • Parallel kernels • Threads synchronization • Race conditions and atomic operations
Architecture NVIDIA Tesla M 2070: Core clock: 1. 15 GHz Single instruction 448 CUDA cores 1. 15 x 1 x 448 = 515 Gigaflops double precision (peak) 1. 03 Tflops single precision (peak) 3 GB total dedicated memory Delivers performance at about 10% of the cost and 5% the power of CPU
Architecture CUDA: • Compute Unified Device Architecture • General Purpose Parallel Computing Architecture by NVIDIA • Supports traditional Open. GL graphics
Architecture Memory Bandwidth: the rate at which data can be read from or stored into memory, expressed in bytes per second Intel Xeon X 5650: 32 GB/s Tesla M 2070: 148 GB/s
Architecture Tesla M 2070 Processor: • Streaming Multiprocessors (SM): • Streaming Processors on each SM: Total: 14 32 14 x 32 = 448 Cores Each Streaming Multiprocessor supports 1024 threads.
Architecture CUDA: SIMT philosophy: Single Instruction Multiple Thread Computationally intensive—The time spent on computation significantly exceeds the time spent on transferring data to and from GPU memory. Massively parallel—The computations can be broken down into hundreds or thousands of independent units of work.
Architecture # Copy tutorial files scc 1 % cp –r /scratch/katia/cuda. # Request interactive session on the node with GPU scc 1 % qrsh –l gpus=1 # Change directory scc 1 -ha 1 % cd device. Query # Set Environment variables to link to CUDA 5/0 scc 1 -ha 1 % module load cuda/5. 0 # Execute device. Query program scc 1 -ha 1 %. /device. Query
Architecture Information that we will need later in this tutorial: CUDA Driver Version / Runtime Version 5. 0 / 5. 0 CUDA Capability Major/Minor version number: 2. 0 Total amount of global memory: 5375 MBytes (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768
CUDA Architecture Information that we will need later in this tutorial: Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535
CUDA Architecture Query device capabilities and measure GPU/CPU bandwidth. This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e # Change directory scc 1 -ha 1 % cd bandwidth. Test # Execute bandwidth. Test program scc 1 -ha 1 %. /bandwidth. Test
CUDA Terminology CUDA: Host The CPU and its memory (host memory) Device The GPU and its memory (device memory)
CUDA: C Language Extensions CUDA: • Based on industry-standard C • Language extensions allow heterogeneous programming • APIs for memory and device managing
Hello, Cuda! CUDA: Basic example Hello. Cuda 1. cu #include <stdio. h> int main(void){ printf("Hello, Cuda! n"); return(0); } To build the program, use nvcc compiler: scc-he 1: % nvcc -o hello. Cuda 1. cu
Hello, Cuda! CUDA Language closely follows C/C++ syntax with minimum set of extensions: Function to be executed on the device (GPU) and called from host code __device__ void foo(){. . . } NVCC compiler will compile the function that run on the device and host compiler (gcc) will take care about all other functions that run on the host (e. g. main() )
Hello, Cuda! CUDA: Basic example Hello. Cuda 2. cu #include <stdio. h> __global__ void cudakernel(void){ printf("Hello, I am CUDA kernel ! Nice to meet you!n"); }
Hello, Cuda! CUDA: Basic example Hello. Cuda 2. cu int main(void){ printf("Hello, Cuda! n"); cudakernel<<<1, 1>>>(); cuda. Device. Synchronize(); printf("Nice to meet you too! Bye, CUDAn"); return(0); }
Hello, Cuda! CUDA: Basic example Hello. Cuda 2. cu cudakernel<<<N, M>>>(); cuda. Device. Synchronize(); Triple angle brackets indicate that the function will be executed on the device (GPU). This function is called kernel. Kernel is always of type void. Program returns immediately after launching the kernel. To prevent program to finish before kernel is completed, we have call cuda. Device. Synchronize().
CUDA: C Language Extensions There is a number of cuda functions: Device management: cuda. Get. Device. Count(), cuda. Get. Device. Properties() Error management: cuda. Get. Last. Error(), cuda. Safe. Call(), cuda. Check. Error() Device memory management: cuda. Malloc(), cuda. Free(), cuda. Memcpy()
Hello, Cuda! CUDA: Basic example Hello. Cuda 2. cu To build the program, use nvcc compiler: scc-he 1: % nvcc -o hello. Cuda 2. cu –arch sm_20 The ability to print from within the kernel was added in a later generation of architectural evolution. To request the support of Compute Capability 2. 0, we need to add this option into compilation command line.
Hello, Cuda! CUDA: Basic example Hello. Cuda. Block. cu #include <stdio. h> __global__ void cudakernel(void){ printf("Hello, I am CUDA block %d !n", block. Idx. x); } int main(void){. . . cudakernel<<<16, 1>>>(); . . . } To simplify compilation process we will use Makefile: % make Hello. Cuda. Block
CUDA: C Language Extensions CUDA provides special variable for thread identification in the kernal: dim 3 thread. Idx; // thread ID within the block dim 3 block. Idx; // block ID within the grid dim 3 block. Dim; // number of threads per block dim 3 grid. Dim; // number of blocks in the grid In the simple 1 -dimentional case, we use only the first component of each variable, e. g. thread. Idx. x
CUDA: Blocks and Threads Serial Code Kernel A Serial Code Kernel B Host Device
CUDA: C Language Extensions CUDA: Basic example Hello. Cuda. Thread. cu #include <stdio. h> __global__ void cudakernel(void){ printf("Hello, I am CUDA thread %d !n", thread. Idx. x); } int main(void){. . . cudakernel<<<1, 16>>>(); . . . }
CUDA: Blocks and Threads • One kernel is executed on the device at a time • Many threads execute each kernel • Each thread execute the same code (SPMD) • Threads are grouped into thread blocks • Kernel is a grid of thread blocks • Threads are scheduled as sets of warps • Warp is a group of 32 threads • SM executes same instruction on all threads in the warp • Blocks cannot synchronize and can run in any order
Vector Addition Example CUDA: vector. Add. cu __global__ void vector. Add(const float *A, const float *B, float *C, int num. Elements){ int i = block. Dim. x * block. Idx. x + thread. Idx. x; if (i < num. Elements) { C[i] = A[i] + B[i]; } }
Vector Addition Example CUDA: vector. Add. cu thread. Idx. x 0 1 23 4 5 6701234 567012345670123 4 5 6 7 block. Idx. x = 0 block. Idx. x = 1 block. Idx. x = 2 block. Idx. x = 3 int i = block. Dim. x * block. Idx. x + thread. Idx. x; Unlike blocks, threads have mechanisms to communicate and synchronize
Vector Addition Example CUDA: vector. Add. cu device memory allocation int main(void) {. . . float *d_A = NULL; err = cuda. Malloc((void **)&d_A, size); float *d_B = NULL; err = cuda. Malloc((void **)&d_B, size); float *d_C = NULL; err = cuda. Malloc((void **)&d_C, size); . . . }
Vector Addition Example CUDA: vector. Add. cu int main(void) {. . . // Copy input values to the device cuda. Memcpy(d_A, &A, size, cuda. Memcpy. Host. To. Device); . . . }
Vector Addition Example CUDA: vector. Add. cu int main(void) {. . . // Launch the Vector Add CUDA Kernel int threads. Per. Block = 256; int blocks. Per. Grid =(num. Elements + threads. Per. Block - 1) / threads. Per. Block; vector. Add<<<blocks. Per. Grid, threads. Per. Block>>>(d_A, d_B, d_C, N); err = cuda. Get. Last. Error(); . . . }
Vector Addition Example CUDA: vector. Add. cu int main(void) {. . . // Copy result back to host cuda. Memcpy(&C, d_C, size, cuda. Memcpy. Device. To. Host); // Clean-up cuda. Free(d_A); cuda. Free(d_B); cuda. Free(d_C); . . . }
Timing CUDA kernel CUDA: vector. Add. Time. cu float memsettime; cuda. Event_t start, stop; // initialize CUDA timer cuda. Event. Create(&start); cuda. Event. Create(&stop); cuda. Event. Record(start, 0); // CUDA Kernel. . . // stop CUDA timer cuda. Event. Record(stop, 0); cuda. Event. Synchronize(stop); cuda. Event. Elapsed. Time(&memsettime, start, stop); printf(" *** CUDA execution time: %f *** n", memsettime); cuda. Event. Destroy(start); cuda. Event. Destroy(stop);
Timing CUDA kernel CUDA: vector. Add. Time. cu scc-ha 1 % make // specify the number of threads per block scc-ha 1 % vector. Add. Time 128 Explore the CUDA kernel execution time based on the block size: Remember: • CUDA Streaming Multiprocessor executes threads in warps (32 threads) • There is a maximum of 1024 threads per block (for our GPU) • There is a maximum of 1536 threads per multiprocessor (for our GPU)
Dot Product CUDA: dot. Prod 1. cu a 0 * b 0 a 1 * b 1 a 2 * b 2 a 3 * b 3 + C C = A * B = ( a 0, a 1 , a 2 , a 3 ) * ( b 0, b 1 , b 2 , b 3 ) = a 0 * b 0 + a 1 * b 1 + a 2 * b 2 + a 3 * b 3
Dot Product CUDA: dot. Prod 1. cu A block of threads shares common memory, called shared memory Shared Memory is extremely fast on-chip memory To declare shared memory use __shared__ keyword Shared Memory is not visible to the threads in other blocks
Dot Product CUDA: dot. Prod 1. cu #define N 512 __global__ voiddot( int*a, int*b, int*c ) { // Shared memory for results of multiplication __shared__ inttemp[N]; temp[thread. Idx. x] = a[thread. Idx. x] * b[thread. Idx. x]; // Thread 0 sums the pairwise products if( thread. Idx. x == 0 ) { int sum = 0; for( int i= 0; i< N; i++ ) sum += temp[i]; *c = sum; } } What if thread 0 starts to calculate sum before other threads completed their calculations?
Thread Synchronization CUDA: dot. Prod 1. cu #define N 512 __global__ voiddot( int*a, int*b, int*c ) { // Shared memory for results of multiplication __shared__ inttemp[N]; temp[thread. Idx. x] = a[thread. Idx. x] * b[thread. Idx. x]; __syncthreads(); // Thread 0 sums the pairwise products if( thread. Idx. x == 0 ) { int sum = 0; for( int i= 0; i< N; i++ ) sum += temp[i]; *c = sum; } }
Thread Synchronization CUDA: dot. Prod 1. cu int main(void) {. . . // copy input vectors to the device. . . // Launch CUDA kernel dot. Product. Kernel <<<1, N >>> (dev_A, dev_B, dev_C); . . . // copy input vectors from the device. . . } But our vector is limited to the maximum block size. Can we use blocks?
Race Condition CUDA: dot. Prod 2. cu a 0 * b 0 a 1 * b 1 a 2 * b 2 a 3 * b 3 Block 0 + sum C a 4 * b 4 a 5 * b 5 a 6 * b 6 a 7 * b 7 Block 1 + sum
Race Condition CUDA: dot. Prod 2. cu #define N (2048*2048) #define THREADS_PER_BLOCK 512 __global__ void dot. Product. Kernel( int*a, int*b, int*c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = thread. Idx. x + block. Idx. x * block. Dim. x; temp[thread. Idx. x] = a[index] * b[index]; __syncthreads(); if( thread. Idx. x == 0) { intsum = 0; for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i]; *c += sum; } } Blocks interfere with each other – Race condition
Race Condition CUDA: dot. Prod 2. cu #define N (2048*2048) #define THREADS_PER_BLOCK 512 __global__ void dot. Product. Kernel( int*a, int*b, int*c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = thread. Idx. x + block. Idx. x * block. Dim. x; temp[thread. Idx. x] = a[index] * b[index]; __syncthreads(); if( thread. Idx. x == 0) { intsum = 0; for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i]; atomic. Add(c, sum); } }
Atomic Operations Race conditions - behavior depends upon relative timing of multiple event sequences. Can occur when an implied read-modify-write is interruptible Read-Modify-Write uninterruptible – atomic. Add() atomic. Sub() atomic. Min() atomic. Max() atomic. Inc() atomic. Dec() atomic. Exch() atomic. CAS()
CUDA Best Practices NVIDIA’s link: http: //docs. nvidia. com/cuda-c-best-practices-guide/index. html 1. Assess Compare the outcome with the original expectations. 4. Deploy Locate part of the slowest part of the code gcc -O 2 -g -pg myprog. c gprof. /a. out > profile. txt 2. Parallelize 3. Optimize Use CUDA to parallelize code; Use optimize cu* libraries if possible; Overlapping data transfers, fine-tuning operation sequences
CUDA Debugging CUDA-GDB - GNU Debugger that runs on Linux and Mac: http: //developer. nvidia. com/cuda-gdb The NVIDIA Parallel Nsight debugging and profiling tool for Microsoft Windows Vista and Windows 7 is available as a free plugin for Microsoft Visual Studio: http: //developer. nvidia. com/nvidia-parallel-nsight
This tutorial has been made possible by Scientific Computing and Visualization group at Boston University. Katia Oleinik koleinik@bu. edu http: //www. bu. edu/tech/research/training/tutorials/list/
- Slides: 45