Basic CUDA Programming ShinKai Chen skchentwins ee nctu
Basic CUDA Programming Shin-Kai Chen skchen@twins. ee. nctu. edu. tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University
What will you learn in this lab? • Concept of multicore accelerator • Multithreaded/multicore programming • Memory optimization
Slides • Mostly from Prof. Wen-Mei Hwu of UIUC – http: //courses. ece. uiuc. edu/ece 498/al/ Syllabus. html
CUDA – Hardware? Software?
Host-Device Architecture GPU w/ local DRAM (device) CPU (host)
G 80 CUDA mode – A Device Example Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Texture Texture Texture Load/store Global Memory Load/store
Functional Units in G 80 • Streaming Multiprocessor (SM) – 1 instruction decoder ( 1 instruction / 4 cycle ) – 8 streaming processor (SP) SM 0 SM 1 – Shared memory t 0 t 1 t 2 … tm MT IU SP MT IU t 0 t 1 t 2 … tm SP Blocks Shared Memory
Setup CUDA for Windows
CUDA Environment Setup • Get GPU that support CUDA – http: //www. nvidia. com/object/cuda_learn_pro ducts. html • Download CUDA – http: //www. nvidia. com/object/cuda_get. html • CUDA driver • CUDA toolkit • CUDA SDK (optional) • Install CUDA • Test CUDA – Device Query
Setup CUDA for Visual Studio • From scratch – http: //forums. nvidia. com/index. php? sho wtopic=30273 • CUDA VS Wizard – http: //sourceforge. net/projects/cudavs wizard/ • Modified from existing project
Lab 1: First CUDA Program
CUDA Computing Model
Data Manipulation between Host and Device • cuda. Error_t cuda. Malloc( void** dev. Ptr, size_t count ) – Allocates count bytes of linear memory on the device and return in *dev. Ptr as a pointer to the allocated memory • cuda. Error_t cuda. Memcpy( void* dst, const void* src, size_t count, enum cuda. Memcpy. Kind kind) – Copies count bytes from memory area pointed to by src to the memory area pointed to by dst – kind indicates the type of memory transfer • • cuda. Memcpy. Host. To. Host cuda. Memcpy. Host. To. Device cuda. Memcpy. Device. To. Host cuda. Memcpy. Device. To. Device • cuda. Error_t cuda. Free( void* dev. Ptr ) – Frees the memory space pointed to by dev. Ptr
Example • Functionality: – Given an integer array A holding 8192 elements – For each element in array A, calculate A[i]256 and leave the result in B[i]
Now, go and finish your first CUDA program !!!
• Download http: //twins. ee. nctu. edu. tw/~skchen/ lab 1. zip • Open project with Visual C++ 2008 ( lab 1/cuda_lab. vcproj ) – main. cu • Random input generation, output validation, result reporting – device. cu • Lunch GPU kernel, GPU kernel code – parameter. h • Fill in appropriate APIs – GPU_kernel() in device. cu
Lab 2: Make the Parallel Code Faster
Parallel Processing in CUDA • Parallel code can be partitioned into blocks and threads – cuda_kernel<<<n. Blk, n. Tid>>>(…) • Multiple tasks will be initialized, each with different block id and thread id • The tasks are dynamically scheduled – Tasks within the same block will be scheduled on the same stream multiprocessor • Each task take care of single data partition according to its block id and thread id
Locate Data Partition by Built-in Variables • Built-in Variables – grid. Dim • x, y – block. Idx • x, y – block. Dim • x, y, z – thread. Idx • x, y, z
Data Partition for Previous Example When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) int total_task = grid. Dim. x * block. Dim. x ; int task_sn = block. Idx. x * block. Dim. x + thread. Idx. x ; int length = SIZE / total_task ; int head = task_sn * length ;
Processing Single Data Partition
Parallelize Your Program !!!
• Partition kernel into threads – Increase n. Tid from 1 to 512 – Keep n. Blk = 1 • Group threads into blocks – Adjust n. Blk and see if it helps • Maintain total number of threads below 512, e. g. n. Blk * n. Tid < 512
Lab 3: Resolve Memory Contention
Parallel Memory Architecture • Memory is divided into banks to achieve high bandwidth • Each bank can service one address per cycle • Successive 32 -bit words are assigned to successive banks
Lab 2 Review When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)
How about Interleave Accessing? When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)
Implementation of Interleave Accessing cuda_kernel<<<1, 4>>>(…) • head = task_sn • stripe = total_task
Improve Your Program !!!
• Modify original kernel code in interleaving manner – cuda_kernel() in device. cu • Adjusting n. Blk and n. Tid as in Lab 2 and examine the effect – Maintain total number of threads below 512, e. g. n. Blk * n. Tid < 512
Thank You • http: //twins. ee. nctu. edu. tw/~skchen/lab 3. zip • Final project issue – Subject: • Porting & optimizing any algorithm on any multi-core – Demo: • 1 week after final exam @ ED 412 – Group: • 1 ~ 2 person per group * Group member & demo time should be registered after final exam @ ED 412
- Slides: 31