CUDA Overview A Fast Introduction CUDA Overview Joo

CUDA Overview: A Fast Introduction CUDA Overview João Gabriel Felipe Machado Gazolla Advisor: Dr.

� GPUs What is Cuda? Where to Download? How to Install Architecture Performance Visual

“. . . Explain The Basics Of CUDA. . . ” CUDA Overview: A

Compute Unified Device Architecture CUDA is the computing engine in NVIDIA graphics processing units

CUDA Overview: A Fast Introduction CUDA Performance

� Specific CUDA Overview: A Fast Introduction CPU Scenario Code Ex: Population 1024 Soldiers

� Specific CUDA Overview: A Fast Introduction GPU Scenario Code Ex: Population 1024 Soldiers

CUDA Overview: A Fast Introduction What do I need to run CUDA?

CUDA Overview: A Fast Introduction Where to Download CUDA ?

CUDA Overview: A Fast Introduction What to Download ?

5% Faster? 20% Faster? 300% Faster? 900% Faster? CUDA Overview: A Fast Introduction Does

Low Cost, Supercomputing for the Masses CUDA Overview: A Fast Introduction Unified Architecture -

1 Year 3 Days 1 Day 15 Minutes 2 Minutes 1. 2 Seconds 100

CUDA Overview: A Fast Introduction Unified Architecture - CUDA

1. 000 Bodies CUDA Overview: A Fast Introduction Example: Crowd Simulation

� CPUs vs GPUs CUDA Overview: A Fast Introduction Architecture

Fixed Function GPUs Programmable GPUs Unified Architecture CUDA Overview: A Fast Introduction GPU –

Fixed Function GPUs • Not Programmable Architecture • No Acess to the Processor •

Programmable GPUs • Architecture Oriented to Computer Graphics CUDA Overview: A Fast Introduction GPU

CUDA Overview: A Fast Introduction Getting VS 2008 for Free

CUDA Overview: A Fast Introduction VS 2008 Integration

Command line: $(CUDA_BIN_PATH)nvcc. exe -ccbin "$(VCInstall. Dir)bin" -c D_DEBUG -DWIN 32 -D_CONSOLE -D_MBCS

CUDA Overview: A Fast Introduction CUDA VS Wizard

CUDA and Linux CUDA Overview: A Fast Introduction

CUDA Overview: A Fast Introduction CUDA and Eclipse

CUDA Overview: A Fast Introduction Software Architecture

Why Programming in Threads? CUDA Overview: A Fast Introduction CUDA and Threads

How many threads have you Ever created? CUDA Allow thousands and Thousands of threads

CPU GPU Few Threads If we Need Thounsads Threads 1000 inst. to change Threads,

Must be Explicit “…synchronization is accomplished using the function syncthreads, syncthreads which acts as

Cuda extends the C Language through the kernels *. cu – CUDA Files Each

CUDA Overview: A Fast Introduction Conventions

Functions in CUDA Executed Combinations are also Possible No recursion at the device (GPU)

CUDA Overview: A Fast Introduction CUDA and Limits of Bandwidth of Memory Reuse your

� Hide Implementation Details � HW Evolution CUDA Overview: A Fast Introduction Architecture

Threads, Blocks and Grids One Kernel One Grid Each Block Many Threads All Threads

Threads, Blocks and Grids Each Block up to 512 threads

Variables Type Spec. grid. Dim 3 Grid Dimension block. Idx Uint 3 Index of

$// Kernel definition __global__ void vec. Add(float* A, float* B, float* C){. . .$

CUDA Overview: A Fast Introduction Some code. . .

Calc. Score <<< blocks , threads. Per. Block >>> (score. Sol. D, v. Sol.

Sum two Matrixes. . . CUDA Overview: A Fast Introduction Suggested Exercise. . .

- CUDA does not generate Random Numbers - CUDA has no sorting methods -

GPUs will Probably Disappear. . . CUDA Overview: A Fast Introduction Trends. . .

CUDA Overview: A Fast Introduction Nvidia - CUDA Education

CUDA Overview: A Fast Introduction Learning CUDA - Dr. Dobb’s

Study Plan Study CUDA Reference Provide somehow a CUDA Supported Device or Emulate One

References http: //www. nvidia. com/object/cuda_develop. html Quickstart guide Programming guide reference manual Toolkit release

Thanks Esteban Clua - http: //www. ic. uff. br/~esteban/ Bruno Cardoso Lopes- http: //www.

� Download of the Presentation: ◦ www. tinyurl. com/mjpktf Cache Tuning – Global Cyber

Slides: 65

Download presentation

CUDA Overview: A Fast Introduction CUDA Overview João Gabriel Felipe Machado Gazolla Advisor: Dr. Esteban Clua

� GPUs What is Cuda? Where to Download? How to Install Architecture Performance Visual Studio Integration Examples How to Learn more about. CUDA? Study Plan References Discussion CUDA Overview: A Fast Introduction Topics

“. . . Explain The Basics Of CUDA. . . ” CUDA Overview: A Fast Introduction Goal

Compute Unified Device Architecture CUDA is the computing engine in NVIDIA graphics processing units or GPUs, that is accessible to software developers through industry standard programming languages CUDA Overview: A Fast Introduction What is CUDA?

CUDA Overview: A Fast Introduction CUDA Performance

� Specific CUDA Overview: A Fast Introduction CPU Scenario Code Ex: Population 1024 Soldiers soldier. Score(x) Fitness Function 12387 Unit Points Soldier[i] soldier. Score(soldier[i]) Soldier[0. . . 1023] (1024/1) *time(soldier. Score())

� Specific CUDA Overview: A Fast Introduction GPU Scenario Code Ex: Population 1024 Soldiers soldier. Score(x) Fitness Function Ge. Force XXXX++ 256 processors Soldier[i]. . . Soldier[i+n] soldier. Score(soldier[i]) 12387. . . 12494. . . 15912 Unit Points Soldier[0. . . 1023] (1024/256) *time(soldier. Score())

CUDA Overview: A Fast Introduction What do I need to run CUDA?

CUDA Overview: A Fast Introduction Where to Download CUDA ?

CUDA Overview: A Fast Introduction What to Download ?

5% Faster? 20% Faster? 300% Faster? 900% Faster? CUDA Overview: A Fast Introduction Does it Worth?

Low Cost, Supercomputing for the Masses CUDA Overview: A Fast Introduction Unified Architecture - CUDA

1 Year 3 Days 1 Day 15 Minutes 2 Minutes 1. 2 Seconds 100 x CUDA Overview: A Fast Introduction Does it Worth? Speedups

CUDA Overview: A Fast Introduction Unified Architecture - CUDA

Low Cost, Supercomputing for the Masses CUDA Overview: A Fast Introduction Unified Architecture - CUDA

1. 000 Bodies CUDA Overview: A Fast Introduction Example: Crowd Simulation

� CPUs vs GPUs CUDA Overview: A Fast Introduction Architecture

Fixed Function GPUs Programmable GPUs Unified Architecture CUDA Overview: A Fast Introduction GPU – The Evolution

Fixed Function GPUs • Not Programmable Architecture • No Acess to the Processor • Only APIs CUDA Overview: A Fast Introduction GPU – The Evolution

Programmable GPUs • Architecture Oriented to Computer Graphics CUDA Overview: A Fast Introduction GPU – The Evolution

CUDA Overview: A Fast Introduction Unified Architecture - CUDA

CUDA Overview: A Fast Introduction Getting VS 2008 for Free

CUDA Overview: A Fast Introduction VS 2008 Integration

Command line: $(CUDA_BIN_PATH)nvcc. exe -ccbin "$(VCInstall. Dir)bin" -c D_DEBUG -DWIN 32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc, /W 3, /nologo, /Od, /Zi, /RTC 1, /MDd -I"$(CUDA_INC_PATH)" I. / -o $(Configuration. Name)kernel. obj kernel. cu Outputs: $(Configuration. Name)kernel. obj CUDA Overview: A Fast Introduction VS 2008 Integration

CUDA Overview: A Fast Introduction VS 2008 Integration

CUDA Overview: A Fast Introduction CUDA VS Wizard

CUDA and Linux

CUDA and Linux CUDA Overview: A Fast Introduction

CUDA Overview: A Fast Introduction CUDA and Eclipse

CUDA Overview: A Fast Introduction Software Architecture

Why Programming in Threads? CUDA Overview: A Fast Introduction CUDA and Threads

How many threads have you Ever created? CUDA Allow thousands and Thousands of threads = Cluster of Threads CUDA Overview: A Fast Introduction CUDA and Threads

CPU GPU Few Threads If we Need Thounsads Threads 1000 inst. to change Threads, it’s 1000 inst It’s NOT ok. CUDA Overview: A Fast Introduction Threads – Management Costs

Must be Explicit “…synchronization is accomplished using the function syncthreads, syncthreads which acts as a barrier or memory fence…” CUDA Overview: A Fast Introduction Cuda - Synchronization

Cuda extends the C Language through the kernels *. cu – CUDA Files Each Kernel is a function that will be executed N times on the device CUDA Overview: A Fast Introduction Cuda – Important Definitions

CUDA Overview: A Fast Introduction Conventions

Functions in CUDA Executed Combinations are also Possible No recursion at the device (GPU) No static variables cuda. Malloc() cuda. Free() Called

CUDA Overview: A Fast Introduction CUDA and Limits of Bandwidth of Memory Reuse your Data!

� Hide Implementation Details � HW Evolution CUDA Overview: A Fast Introduction Architecture

Threads, Blocks and Grids One Kernel One Grid Each Block Many Threads All Threads inside a block share the same memory area Threads in different blocks do not share memory their local memory among them Threads in different blocks cannot cooperate

Threads, Blocks and Grids Each Block up to 512 threads

Variables Type Spec. grid. Dim 3 Grid Dimension block. Idx Uint 3 Index of the block in the grid block. Dim 3 Dimension of the block thread. Idx Uint 3 Index of the thread in the block __ global__ void Kernel. Function (. . . ) dim 3 Dim. Grid (100, 10); // Grid 1000 Blocks dim 3 Dim. Block (4, 8, 8); // Each block has 256 threads Size_t Shared. Mem. Bytes = 32 Kernel. Fun << Dim. Grid, Dim. Block, Shared. Mem. Bytes>> (. . . ); CUDA Overview: A Fast Introduction Threads, Blocks and Grids

$// Kernel definition __global__ void vec. Add(float* A, float* B, float* C){. . .$

// Kernel definition __global__ void vec. Add(float* A, float* B, float* C){. . . } int main(){ // Kernel invocation vec. Add<<<1, N>>>(A, B, C); } __global defines that it’s a kernel… Called on The Host Executed on The Device CUDA Overview: A Fast Introduction Some code. . .

Some code. . .

CUDA Overview: A Fast Introduction Some code. . .

Calc. Score <<< blocks , threads. Per. Block >>> (score. Sol. D, v. Sol. D , mat. D , dim); 1|2|3|4|5|6|3|1|2|4|5|6|2|6|5|1|3|4|5|6|2|4|3|1|2|6|3|1|5|4 __global__ void calc. Score(float * score , int * sol , float * mat , int dim){ //ID da Thread em X int idx = block. Idx. x * block. Dim. x + thread. Idx. x; //Calc the initial position where the Threads is going to work int pos = (idx * dim); int temp; score[idx] = 0; //Vector part where thread is going to work for( int i = pos ; i < (pos+dim) - 1 ; ++i){ temp = sol[i] * dim + sol[i+1]; score[idx] += mat[temp]; } //The Last to the first temp = sol[pos+dim-1] * dim + sol[pos]; score[idx] += mat[temp]; } GPUs and the Travelling Salesman Problem Kernel TSP Score

Sum two Matrixes. . . CUDA Overview: A Fast Introduction Suggested Exercise. . .

- CUDA does not generate Random Numbers - CUDA has no sorting methods - In Cuda everything is vectors (arrays)

GPUs will Probably Disappear. . . CUDA Overview: A Fast Introduction Trends. . .

Scalability

CUDA Overview: A Fast Introduction Nvidia - CUDA Education

Nvidia - CUDA Reference

CUDA Overview: A Fast Introduction Learning CUDA - Dr. Dobb’s

Study Plan Study CUDA Reference Provide somehow a CUDA Supported Device or Emulate One Watch Nvidia Cuda Casts Make it Work on Linux and/or Windows Watch Davir Kirk (Illinois Univ. ) Cuda casts Read Dr. Dobb’s Articles

References http: //www. nvidia. com/object/cuda_develop. html Quickstart guide Programming guide reference manual Toolkit release notes SDK release notes windows

Thanks Esteban Clua - http: //www. ic. uff. br/~esteban/ Bruno Cardoso Lopes- http: //www. brunocardoso. cc/ Rodolfo Jardim de Azevedo - http: //www. ic. unicamp. br/~rodolfo CUDA Overview: A Fast Introduction Marcelo Zamith - http: //www. ic. uff. br/~mzamith/

� Download of the Presentation: ◦ www. tinyurl. com/mjpktf Cache Tuning – Global Cyber Bridges Doubts? Comments? Extras?