Lecture 8 Manycore GPU Programming with CUDA Courtesy

  • Slides: 38
Download presentation
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture note

Moore’s Law n Transistor count of integrated circuits doubles every two years

Moore’s Law n Transistor count of integrated circuits doubles every two years

The Need of Multicore Architecture n Hard to design high clock speed (frequency) n

The Need of Multicore Architecture n Hard to design high clock speed (frequency) n n power consumption and heat generation : too high # of cores may still increase

Many-core GPUs n Motivation n Originally driven by the insatiable market demand for realtime,

Many-core GPUs n Motivation n Originally driven by the insatiable market demand for realtime, high-definition 3 D graphics n programmable GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth n GPGPU n n General Purpose computing on GPU (Graphical Processing Unit) Utilization of GPU (typically handles computations for graphics) to perform general purpose computation (traditionally handled by CPU)

Processor : Multicore vs Many-core n Multicore direction (CPU) : 2~8 cores n n

Processor : Multicore vs Many-core n Multicore direction (CPU) : 2~8 cores n n n Typically handles general purpose computation seeks to maintain/increase the execution speed of sequential programs Complex : out-of-order, multiple instruction issue, branch prediction, pipelining, large cache, … while moving into multiple cores Ex) Intel i 7 has 4 cores (hexa-core was released recently) Many-core direction (GPU) : 100~3000 cores n n n Focus on the execution throughput of parallel applications Simple : in order, single instruction issue Large number of smaller cores

Many-core GPU n NVIDIA GTX 780 Ti n n n Around 3000 cores on

Many-core GPU n NVIDIA GTX 780 Ti n n n Around 3000 cores on single chip Economic price : mass-market product (around $700) Easy to program : CUDA

GPU n Specially designed for highly parallel applications n n n Programmable using high

GPU n Specially designed for highly parallel applications n n n Programmable using high level languages (C/C++) Supports standard 32 -bit floating point precision Lots of GFLOPS

GPU Fast processing must come with high bandwidth! n Simpler memory models and fewer

GPU Fast processing must come with high bandwidth! n Simpler memory models and fewer constraints allow high bandwidth n Memory bandwidth n n the rate at which data can be read from or stored into memory by a processor

GPU n GPU is specialized for n n Compute-intensive Highly data parallel computation n

GPU n GPU is specialized for n n Compute-intensive Highly data parallel computation n n More transistors devoted to data processing rather than data caching and flow control What graphics rendering needs? n n the same program is executed on many data elements in parallel Ex) matrix computation Geometry(vertex) + Pixel processing Motivates many application developers to move the computationally intensive parts of their software to GPUs for execution

Applications n 3 D rendering n n image and media processing applications such as

Applications n 3 D rendering n n image and media processing applications such as postprocessing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition n n large sets of pixels and vertices are mapped to parallel threads. can map image blocks and pixels to parallel processing threads. many other kinds of algorithms are accelerated by data-parallel processing n from general signal processing or physics simulation to computational finance or computational biology.

CPU vs GPU n CPU: Optimized for sequential code performance n sophisticated control logic

CPU vs GPU n CPU: Optimized for sequential code performance n sophisticated control logic n n n large cache memory n n to allow instructions from single thread to execute in parallel or even out-of-order branch prediction to reduce the instruction and data access latencies Powerful ALU : reduced operation latency ALU ALU Control Cache CPU DRAM GPU DRAM CPU vs GPU : fundamentally different design philosophies

CPU vs GPU n GPU: Optimized for execution throughput of multiple threads n Originally

CPU vs GPU n GPU: Optimized for execution throughput of multiple threads n Originally for fast (3 D) video game n n Minimize control logic and cache memory n n Requires a massive number of floating-point calculations per frame Much more chip area is dedicated to the floating-point calculations Boost memory throughput Energy Efficient ALU Designed as (data parallel) numeric computing engines Control CPU DRAM ALU ALU Cache GPU DRAM CPU vs GPU : fundamentally different design philosophies

GPU Architecture GPUs consist of many simple cores n Array of highly threaded streaming

GPU Architecture GPUs consist of many simple cores n Array of highly threaded streaming multiprocessors (SMs) n Two or more SMs form a buliding block. n

GPU chip design GPU core is stream processor n Stream processors are grouped in

GPU chip design GPU core is stream processor n Stream processors are grouped in stream multiprocessors n n SM is basically a SIMD processor (single instruction multiple data)

CPU vs GPU n n n GPU designed for many simple tasks Maximize throughput

CPU vs GPU n n n GPU designed for many simple tasks Maximize throughput (# of tasks in fixed time) CPU n Minimize latency (time to complete a task)

Winning Applications Use Both CPU and GPU n GPUs will not perform well on

Winning Applications Use Both CPU and GPU n GPUs will not perform well on some tasks on which CPUs perform well n Use both CPUs and GPUs n n n Executing essentially sequential parts on CPU Numerically intensive parts on GPU CUDA n n Introduced by NVIDIA in 2007 Designed to support joint CPU/GPU execution of applications

Popularity of GPUs Performance n Cost n large marketplace & customer population n Practical

Popularity of GPUs Performance n Cost n large marketplace & customer population n Practical factors and easy accessibility n n GE MRI with {clusters and GPU} Support of IEEE floating-point standard n CUDA n n n Programmer can use C/C++ programming tools No longer go through complex graphics interface

Why more parallelism? Applications will continue to demand increased speed n A good implementation

Why more parallelism? Applications will continue to demand increased speed n A good implementation on GPU can achieve more than 100 times speedup over sequential execution n Supercomputing applications n n Any applications that require data-parallel calculations such as matrix calculations

CUDA (Computer Unified Device Architecture) Parallel Computing Framework Developed by NVIDIA (working only on

CUDA (Computer Unified Device Architecture) Parallel Computing Framework Developed by NVIDIA (working only on NVIDIA cards) n Introduced in 2006 n General Purpose Programming Model n n n GPGPU (General Purpose GPU) Offers a computing API Explicit GPU memory management Goal n Develop application SW that transparently scales its parallelism to leverage the increasing number of processor cores

CUDA enabled GPUs • warp : group of threads where multiprocessor executes the same

CUDA enabled GPUs • warp : group of threads where multiprocessor executes the same instruction at each clock cycle

Compute Capability general specifications and features of compute device n Defined by major revision

Compute Capability general specifications and features of compute device n Defined by major revision number and minor revision number n n Ex) 1. 3 , 2. 1 , 3. 5 , 5. 0 n 5 3 2 1 n n n : : maxwell architecture Kepler architecture Fermi architecture Tesla architecture

CUDA – Main Features C/C++ with extensions n Heterogeneous programming model n Operates in

CUDA – Main Features C/C++ with extensions n Heterogeneous programming model n Operates in CPU(host) and GPU (device) n

CUDA Device and Threads n Device n n Is a coprocessor to the CPU

CUDA Device and Threads n Device n n Is a coprocessor to the CPU or host Has access to DRAM (device memory) Runs many threadsin parallel Is typically a GPUbut can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads n Differences between GPU and CPU threads n n n GPU threads are extremely lightweight (little overhead for creation) GPU needs 1000 s of threads for full efficiency n (multicore CPU needs only a few)

Processing Flow

Processing Flow

Example 1: Hello world #include <stdio. h> void hello_world(void) { printf(“Hello Worldn”); } int

Example 1: Hello world #include <stdio. h> void hello_world(void) { printf(“Hello Worldn”); } int main (void) { hello_world(); return 0; }

Example 1: CUDA Hello world #include <stdio. h> __global__ void hello_world(void) { printf(“Hello Worldn”);

Example 1: CUDA Hello world #include <stdio. h> __global__ void hello_world(void) { printf(“Hello Worldn”); } int main (void) { hello_world<<<1, 5>>>(); return 0; }

Compile and Run n output Hello Hello World World

Compile and Run n output Hello Hello World World

C Language Extensions n Function Type Qualifiers __global__ executed on the device (GPU) callable

C Language Extensions n Function Type Qualifiers __global__ executed on the device (GPU) callable from the host (CPU) only functions should have void return type any call to a __global__ function must specify the execution configuration for that call

Grid, Block, Thread n Tesla S 2050, Geforce 580 max. block size of each

Grid, Block, Thread n Tesla S 2050, Geforce 580 max. block size of each Dim per grid 65535 x 1 n max. thread size of each Dim per block 1024 x 64 n max. # of threads per block 1024 n

C Language Extensions n Execution configuration <<<blocks. Per. Grid, threads. Per. Block>>> <<<1, 1>>>

C Language Extensions n Execution configuration <<<blocks. Per. Grid, threads. Per. Block>>> <<<1, 1>>> <<<65535, 1024> dim 3 blocks. Per. Grid(65535, 1) dim 3 threads. Per. Block(1024, 1, 1) <<<blocks. Per. Grid, threads. Per. Block>>>

C Language Extensions n Built-in Variables block. Idx = (block. Idx. x, block. Idx.

C Language Extensions n Built-in Variables block. Idx = (block. Idx. x, block. Idx. y, block. Idx. z) three unsigned integers, uint 3 thread. Idx = (thread. Idx. x, thread. Idx. y, thread. Idx. z) three unsigned integers, uint 3 n Built-in Vector types dim 3: Integer vector type based on unit 3 used to specify dimensions

#include <stdio. h> __global__ void exec_conf(void) { int ix = thread. Idx. x +

#include <stdio. h> __global__ void exec_conf(void) { int ix = thread. Idx. x + block. Idx. x * block. Dim. x; printf("grid. Dim = (%d, %d), block. Dim = (%d, %d)n", grid. Dim. x, grid. Dim. y, grid. Dim. z, block. Dim. x, block. Dim. y, block. Dim. z); printf("block. Idx = (%d, %d), thread. Idx = (%d, %d), array. Idx %dn", block. Idx. x, block. Idx. y, block. Idx. z, thread. Idx. x, thread. Idx. y, thread. Idx. z, ix); } int main (void) { exec_conf<<<2, 3>>>(); return 0; }

Compile and Run n Output grid. Dim = (2, 1, 1), block. Dim =

Compile and Run n Output grid. Dim = (2, 1, 1), block. Dim = block. Idx = (0, 0, 0), thread. Idx block. Idx = (1, 0, 0), thread. Idx (3, 1, 1) (3, 1, 1) = (0, 0, 0), = (1, 0, 0), = (2, 0, 0), array. Idx = = = 0 1 2 3 4 5

#include <stdio. h> __global__ void exec_conf(void) { int ix = thread. Idx. x +

#include <stdio. h> __global__ void exec_conf(void) { int ix = thread. Idx. x + block. Idx. x * block. Dim. x; int iy = thread. Idx. y + block. Idx. y * block. Dim. y; printf("grid. Dim = (%d, %d), block. Dim = (%d, %d)n", grid. Dim. x, grid. Dim. y, grid. Dim. z, block. Dim. x, block. Dim. y, block. Dim. z); printf("block. Idx = (%d, %d), thread. Idx = (%d, %d), array. Idx=(%d, %d)n", block. Idx. x, block. Idx. y, block. Idx. z, thread. Idx. x, thread. Idx. y, thread. Idx. z, ix, iy); } int main (void) { dim 3 blocks(2, 2, 1); dim 3 threads(2, 2, 2); exec_conf<<<blocks, threads>>>(); return 0; }

Example 3: Vector sum #include <stdio. h> const int N=128; void add(int *a, int

Example 3: Vector sum #include <stdio. h> const int N=128; void add(int *a, int *b, int *c) { for (int i=0; i<N; i++) { c[i] = a[i] + b[i]; } } int main (void) { int a[N], b[N], c[N]; for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } add (a, b, c); for (int i=0; i<N; i++) { printf("%d + %d = %dn", a[i], b[i], c[i]); } return 0; }

Example 3: Vector sum #include <stdio. h> const int N=10; __global__ void add(int *a,

Example 3: Vector sum #include <stdio. h> const int N=10; __global__ void add(int *a, int *b, int *c) { int tid = thread. Idx. x; c[tid] = a[tid] + b[tid]; }

int main (void) { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; cuda. Malloc(

int main (void) { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; cuda. Malloc( (void**)&dev_a, N * sizeof(int) ); cuda. Malloc( (void**)&dev_b, N * sizeof(int) ); cuda. Malloc( (void**)&dev_c, N * sizeof(int) ); for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } cuda. Memcpy ( dev_a, a, N * sizeof(int), cuda. Memcpy. Host. To. Device ); cuda. Memcpy ( dev_b, b, N * sizeof(int), cuda. Memcpy. Host. To. Device ); add<<<1, N>>>(dev_a, dev_b, dev_c); // add<<<N, 1>>>(dev_a, dev_b, dev_c); // add<<<128, 128>>>(dev_a, dev_b, dev_c); cuda. Memcpy(c, dev_c, N * sizeof(int), cuda. Memcpy. Device. To. Host ); for (int i=0; i<N; i++) { printf("%d + %d = %dn", a[i], b[i], c[i]); } cuda. Free (dev_a); cuda. Free (dev_b); cuda. Free (dev_c); return 0; }

Compile and Run n Output 0 + 0 = 0 -1 + 1 =

Compile and Run n Output 0 + 0 = 0 -1 + 1 = 0 -2 + 4 = 2 -3 + 9 = 6 -4 + 16 = 12 -5 + 25 = 20 -6 + 36 = 30 -7 + 49 = 42 -8 + 64 = 56 -9 + 81 = 72