ECE 408 CS 483 Applied Parallel Programming Lecture

CUDA /Open. CL – Execution Model • Integrated host+device app C program – Serial

Arrays of Parallel Threads • A CUDA kernel is executed by a grid (array)

Thread Blocks: Scalable Cooperation • Divide thread array (grid) into multiple blocks – Threads

block. Idx and thread. Idx • Each thread uses indices to decide what data

Vector Addition – Conceptual View vector A vector B vector C A[0] A[1] A[2]

Vector Addition – Traditional C Code // Compute vector sum C = A+B void

Heterogeneous Computing vec. Add Host Code Part 1 void vec. Add(float* A, float* B,

Partial Overview of CUDA Memories • • Device code can: – R/W per-thread registers

CUDA Device Memory Management API functions • cuda. Malloc() Grid – Allocates object in

Host-Device Data Transfer API functions • cuda. Memcpy() (Device) Grid – memory data transfer

void vec. Add(float* A, float* B, float* C, int n) { int size =

Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B //

Integer Number of Thread Blocks • For a vector of 1000 elements – 256*4

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread

More on Kernel Launch Host Code int vec. Add(float* A, float* B, float* C,

Kernel execution in a nutshell __host__ __global__ Void vec. Add() void vec. Add. Kernel(float

More on CUDA Function Declarations Executed on the: Only callable from the: __device__ float

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code

QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University

Slides: 20

Download presentation

ECE 408 / CS 483 Applied Parallel Programming Lecture 2: Introduction to CUDA C © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 1

CUDA /Open. CL – Execution Model • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) Kernel. A<<< n. Blk, n. Tid >>>(args); . . . Serial Code (host) Parallel Kernel (device) Kernel. B<<< n. Blk, n. Tid >>>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign . . . 2

Arrays of Parallel Threads • A CUDA kernel is executed by a grid (array) of threads – All threads in a grid run the same kernel code (SPMD) – Each thread has an index that it uses to compute memory addresses and make control decisions – Thread index supplied by hardware 0 1 2 254 255 … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 3 3

Thread Blocks: Scalable Cooperation • Divide thread array (grid) into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate Thread Block 0 0 1 2 254 Thread Block 1 255 … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … 0 1 2 254 Thread Block N-1 255 0 … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … 1 2 254 255 … … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … 4

block. Idx and thread. Idx • Each thread uses indices to decide what data to work on – – • block. Idx: 1 D, 2 D, or 3 D (CUDA 4. 0) thread. Idx: 1 D, 2 D, or 3 D Simplifies memory addressing when processing multidimensional data – – – Image processing Solving PDEs on volumes … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 5

Vector Addition – Conceptual View vector A vector B vector C A[0] A[1] A[2] A[3] A[4] … A[N-1] B[0] B[1] B[2] B[3] B[4] … B[N-1] + + + C[0] C[1] C[2] C[3] C[4] © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign + … C[N-1] 6

Vector Addition – Traditional C Code // Compute vector sum C = A+B void vec. Add(float* A, float* B, float* C, int n) { for (i = 0, i < n, i++) C[i] = A[i] + B[i]; } int main() { // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements … vec. Add(A_h, B_h, C_h, N); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 7

Heterogeneous Computing vec. Add Host Code Part 1 void vec. Add(float* A, float* B, float* C, int n) { Host Memory Device Memory GPU CPU Part 2 int size = n* sizeof(float); float* A_d, B_d, C_d; … 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition Part 3 3. // copy C from the device memory // Free device vectors } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 8

Partial Overview of CUDA Memories • • Device code can: – R/W per-thread registers – R/W per-grid global memory (Device) Grid Block (0, 0) Host code can – Transfer data to/from per grid global memory Block (1, 0) Registers Thread (0, 0) Thread (1, 0) We will cover more later. Host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign Global Memory 9

CUDA Device Memory Management API functions • cuda. Malloc() Grid – Allocates object in the device global memory – Two parameters Block (0, 0) • Address of a pointer to the allocated object • Size of of allocated object in terms of bytes • cuda. Free() Host Block (1, 0) Registers Thread (0, 0) Thread (1, 0) Global Memory – Frees object from device global memory • Pointer to freed object © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 10

Host-Device Data Transfer API functions • cuda. Memcpy() (Device) Grid – memory data transfer – Requires four parameters • • Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Host Block (0, 0) Block (1, 0) Registers Thread (0, 0) Thread (1, 0) Global Memory – Transfer to device is asynchronous © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 11

void vec. Add(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cuda. Malloc((void **) &A_d, size); cuda. Memcpy(A_d, A, size, cuda. Memcpy. Host. To. Device); cuda. Malloc((void **) &B_d, size); cuda. Memcpy(B_d, B, size, cuda. Memcpy. Host. To. Device); // Allocate device memory for cuda. Malloc((void **) &C_d, size); 2. // Kernel invocation code – to be shown later … 3. // Transfer C from device to host cuda. Memcpy(C, C_d, size, cuda. Memcpy. Device. To. Host); // Free device memory for A, B, C cuda. Free(A_d); cuda. Free(B_d); cuda. Free (C_d); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 12 } ECE 408/CS 483, University of Illinois, Urbana-Champaign

Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vec. Add. Kernel(float* A_d, float* B_d, float* C_d, int n) { int i = thread. Idx. x + block. Dim. x * block. Idx. x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vect. Add(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vec. Add. Kernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 13

Integer Number of Thread Blocks • For a vector of 1000 elements – 256*4 = 1024 – The last thread block has 24 threads that have no elements to process Thread Block 0 0 1 2 254 Thread Block 1 255 0 1 … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … 2 254 Thread Block N-1 255 0 … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign … 1 2 254 255 … … i = block. Idx. x * block. Dim. x + thread. Idx. x; C_d[i] = A_d[i] + B_d[i]; … 14

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vec. Addkernel(float* A_d, float* B_d, float* C_d, int n) { int i = thread. Idx. x + block. Dim. x * block. Idx. x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } Host Code int vec. Add(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vec. Add. Kernnel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 15

More on Kernel Launch Host Code int vec. Add(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim 3 Dim. Grid(n/256, 1, 1); if (n%256) Dim. Grid. x++; dim 3 Dim. Block(256, 1, 1); vec. Add. Kernnel<<<Dim. Grid, Dim. Block>>>(A_d, B_d, C_d, n); } • Any call to a kernel function is asynchronous from CUDA 1. 0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 16

Kernel execution in a nutshell __host__ __global__ Void vec. Add() void vec. Add. Kernel(float *A_d, { float *B_d, float *C_d, int n) dim 3 Dim. Grid = (ceil(n/256), 1, 1); { dim 3 Dim. Block = (256, 1, 1); int i = block. Idx. x * block. Dim. x vec. Add. Kernel<<<Dim. Grid, Dim. Block>>>( + thread. Idx. x; A_d, B_d, C_d, n); } if( i<n ) C_d[i] = A_d[i]+B_d[i]; } Kernel • • • Blk 0 Blk N-1 Schedule onto multiprocessors M 0 GPU • • • RAM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign Mk 17

More on CUDA Function Declarations Executed on the: Only callable from the: __device__ float Device. Func() device __global__ void device host __host__ • Kernel. Func() float Host. Func() __global__ defines a kernel function • Each “__” consists of two underscore characters • A kernel function must return void • __device__ and __host__ can be used together © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 18

Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Host C Compiler/ Linker Device Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 19

QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2011 ECE 408/CS 483, University of Illinois, Urbana-Champaign 20