CUDA Lecture 5 CUDA at the University of

Overview: CUDA Equipment �Your own PCs running G 80 emulators � Better debugging environment

Summary: NVIDIA Technology Description Low Power Card Models Where Available Ion Netbooks in CAS

Hardware View, Consumer Procs. �Basic building block is a “streaming multiprocessor” �different chips have

Hardware View, 2 nd Generation �Basic building block is a “streaming multiprocessor” with �

Hardware View, Fermi �each streaming multiprocessor has � 32 cores, each with 1024 registers

Different Compute Capabilities Feature v. 1. 1 v. 1. 3, 2. x Integer atomic

Different Compute Capabilities Feature v. 1. 1, 1. 3 v. 2. x 3 D

Common Technical Specifications Spec 65536 Maximum x- or y- dimensions of a grid of

Different Technical Specifications Spec Maximum number of resident warps per multiprocessor Maximum number of

Different Technical Specifications Spec v. 1. 1, 1. 3 v. 2. x 2 3

Different Technical Specifications Spec Maximum width and number of layers for a 1 D

Overview: CUDA Components �CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment: �

Overview: CUDA Components �When installing CUDA on a system, there are 3 components: �

Overview: CUDA Components �SDK � lots of demonstration examples � a convenient Makefile for

Accessing the Tesla Card �Remotely access the front end: ssh tesla. cs. uakron. edu

Accessing the Tesla Card �The first time you do this: � After login, run

Accessing the Tesla Card �The first time you do this: � Binaries end up

CUDA Makefile �Two choices: � use nvcc within a standard Makefile � use the

Sample Tesla Makefile GENCODE_ARCH : = -gencode=arch=compute_10, code="sm_10, compute_10“ -gencode=arch=compute_13, code="sm_13, compute_13“ -gencode=arch=compute_20, code="sm_20,

Compiling a CUDA Program �Parallel Thread Execution (PTX) � Virtual machine and ISA �

Compilation �Any source file containing CUDA extensions must be compiled with NVCC �NVCC is

Linking �Any executable with CUDA code requires two dynamic libraries � The CUDA runtime

Debugging Using the Device Emulation Mode �An executable compiled in device emulation mode (nvcc

Debugging Using the Device Emulation Mode �Running in device emulation mode, one can �

Device Emulation Mode Pitfalls �Emulated device threads execute sequentially, so simultaneous access of the

Floating Point �Results of floating-point computations will slightly differ because of � Different compiler

Nexus �New Visual Studio Based GPU Integrated Development �http: //developer. nvidia. com/object/nexus. html �Available

End Credits �Based on original material from � http: //en. wikipedia. com/wiki/CUDA, accessed 6/22/2011.

Slides: 29

Download presentation

CUDA Lecture 5 CUDA at the University of Akron Prepared 6/23/2011 by T. O’Neil for 3460: 677, Fall 2011, The University of Akron.

Overview: CUDA Equipment �Your own PCs running G 80 emulators � Better debugging environment � Sufficient for the first couple of weeks �Your own PCs with a CUDA-enabled GPU �NVIDIA boards in department � Ge. Force family of processors for high-performance gaming � Tesla C 2070 for high-performance computing – no graphics output (? ) and more memory CUDA at the University of Akron – Slide 2

Summary: NVIDIA Technology Description Low Power Card Models Where Available Ion Netbooks in CAS 241. Consumer Graphics Processors Ge. Force 8500 GT Ge. Force 9600 GT Add-in cards in Dell Optiplex 745 s in department. 2 nd Generation GPUs Ge. Force GTX 275 In Dell Precision T 3500 s in department. Fermi GPUs Ge. Force GTX 480 In select Dell Precision T 3500 s in department. In Dell Precision T 7500 Linux server (tesla. cs. uakron. edu) Tesla C 2070 CUDA at the University of Akron – Slide 3

Hardware View, Consumer Procs. �Basic building block is a “streaming multiprocessor” �different chips have different numbers of these SMs: Product SMs Compute Capability Ge. Force 8500 GT 2 v. 1. 1 Ge. Force 9500 GT 4 v. 1. 1 Ge. Force 9600 GT 8 v. 1. 1 CUDA at the University of Akron – Slide 4

Hardware View, 2 nd Generation �Basic building block is a “streaming multiprocessor” with � 8 cores, each with 2048 registers � up to 128 threads per core � 16 KB of shared memory � 8 KB cache for constants held in device memory �different chips have different numbers of these SMs: Product SMs Bandwidth Memory Compute Capability GTX 275 30 127 GB/s 1 -2 GB v. 1. 3 CUDA at the University of Akron – Slide 5

Hardware View, Fermi �each streaming multiprocessor has � 32 cores, each with 1024 registers � up to 48 threads per core � 64 KB of shared memory / L 1 cache � 8 KB cache for constants held in device memory �there’s also a unified 384 KB L 2 cache �different chips again have different numbers of SMs: Product SMs Bandwidth Memory Compute Capability GTX 480 15 180 GB/s 1. 5 GB v. 2. 0 Tesla C 2070 14 140 GB/s 6 GB ECC v. 2. 1 CUDA at the University of Akron – Slide 6

Different Compute Capabilities Feature v. 1. 1 v. 1. 3, 2. x Integer atomic functions operating on 64 -bit words in global memory Integer atomic functions operating on 32 -bit words in shared memory no yes Warp vote functions no yes Double-precision floating-point operations no yes CUDA at the University of Akron – Slide 7

Different Compute Capabilities Feature v. 1. 1, 1. 3 v. 2. x 3 D grid of thread block no yes Floating-point atomic addition operating on 32 -bit words in global and shared memory no yes _ballot() no yes _threadfence_system() no yes _syncthread_count(), _syncthread_and(), _syncthread_or() no yes Surface functions no yes CUDA at the University of Akron – Slide 8

Common Technical Specifications Spec 65536 Maximum x- or y- dimensions of a grid of thread blocks 3 Maximum dimensionality of thread block Maximum z- dimension of a block 64 Warp size 32 Maximum number of resident blocks per multiprocessor 8 Constant memory size 64 K Cache working set per multiprocessor for constant memory 8 K Maximum width for 1 D texture reference bound to linear memory 2 27 Maximum width, height and depth for a 3 D texture reference bound to linear memory or a CUDA array Maximum number of textures that can be bound to a kernel Maximum number of instructions per kernel 2048 x 2048 128 2 million CUDA at the University of Akron – Slide 9

Different Technical Specifications Spec Maximum number of resident warps per multiprocessor Maximum number of resident threads per multiprocessor Number of 32 -bit registers per multiprocessor v. 1. 1 v. 1. 3 v. 2. x 24 32 48 768 1024 1536 8 K 16 K 32 K CUDA at the University of Akron – Slide 10

Different Technical Specifications Spec v. 1. 1, 1. 3 v. 2. x 2 3 Maximum x- or y- dimension of a block 512 1024 Maximum number of threads per block 512 1024 Maximum amount of shared memory per multiprocessor 16 K 48 K 16 32 16 K 512 K 8192 32768 Maximum dimensionality of grid of thread block Number of shared memory banks Amount of local memory per thread Maximum width for 1 D texture reference bound to a CUDA array CUDA at the University of Akron – Slide 11

Different Technical Specifications Spec Maximum width and number of layers for a 1 D layered texture reference Maximum width and height for 2 D texture reference bound to linear memory or a CUDA array Maximum width, height, and number of layers for a 2 D layered texture reference Maximum width for a 1 D surface reference bound to a CUDA array Maximum width and height for a 2 D surface reference bound to a CUDA array Maximum number of surfaces that can be bound to a kernel v. 1. 1, 1. 3 v. 2. x 8192 x 512 16384 x 2048 65536 x 32768 65536 x 65536 8192 x 512 16384 x 2048 8192 Not supported 8192 x 8192 8 CUDA at the University of Akron – Slide 12

Overview: CUDA Components �CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment: � based on C with some extensions � C++ support increasing steadily � FORTRAN support provided by PGI compiler � lots of example code and good documentation – 2 -4 week learning curve for those with experience of Open. MP and MPI programming � large user community on NVIDIA forums CUDA at the University of Akron – Slide 13

Overview: CUDA Components �When installing CUDA on a system, there are 3 components: � driver �low-level software that controls the graphics card �usually installed by sys-admin � toolkit �nvcc CUDA compiler �some profiling and debugging tools �various libraries �usually installed by sys-admin in /usr/local/cuda CUDA at the University of Akron – Slide 14

Overview: CUDA Components �SDK � lots of demonstration examples � a convenient Makefile for building applications � some error-checking utilities � not supported by NVIDIA � almost no documentation � often installed by user in own directory CUDA at the University of Akron – Slide 15

Accessing the Tesla Card �Remotely access the front end: ssh tesla. cs. uakron. edu � ssh sends your commands over an encrypted stream so your passwords, etc. , can’t be sniffed over the network CUDA at the University of Akron – Slide 16

Accessing the Tesla Card �The first time you do this: � After login, run /root/gpucomputingsdk_3. 2. 16_linux. run and just take the default answers to get your own personal copy of the SDK. � Then: cd ~/NVIDIA_GPU_Computing_SDK/C make -j 12 -k will build all that can be built. CUDA at the University of Akron – Slide 17

Accessing the Tesla Card �The first time you do this: � Binaries end up in: ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release � In particular header file <cutil_inline. h> is in ~/NVIDIA_GPU_Computing_SDK/C/common/inc �Can then get a summary of technical specs and compute capabilities by executing ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/device. Query CUDA at the University of Akron – Slide 18

CUDA Makefile �Two choices: � use nvcc within a standard Makefile � use the special Makefile template provided in the SDK �The SDK Makefile provides some useful options: � make emu=1 �uses an emulation library for debugging on a CPU � make dbg=1 �activates run-time error checking �In general just use a standard Makefile CUDA at the University of Akron – Slide 19

Sample Tesla Makefile GENCODE_ARCH : = -gencode=arch=compute_10, code="sm_10, compute_10“ -gencode=arch=compute_13, code="sm_13, compute_13“ -gencode=arch=compute_20, code="sm_20, compute_20“ INCLOCS : = -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc LIBLOCS : = -L/usr/local/cuda/lib 64 -L/usr/local/cuda/lib -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib LIBS = -lcutil_x 86_64 <prog. Name>: <prog. Name>. cuh nvcc $(GENCODE_ARCH) $(INCLOCS) <prog. Name>. cu $(LIBLOCS) $(LIBS) -o <prog. Name> CUDA at the University of Akron – Slide 20

Compiling a CUDA Program �Parallel Thread Execution (PTX) � Virtual machine and ISA � Programming model � Execution resources and state CUDA Tools and Threads – Slide 2

Compilation �Any source file containing CUDA extensions must be compiled with NVCC �NVCC is a compiler driver � Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … �NVCC outputs � C code (host CPU code) �Must then be compiled with the rest of the application using another tool � PTX �Object code directly, or PTX source interpreted at runtime CUDA Tools and Threads – Slide 22

Linking �Any executable with CUDA code requires two dynamic libraries � The CUDA runtime library (cudart) � The CUDA core library (cuda) CUDA Tools and Threads – Slide 23

Debugging Using the Device Emulation Mode �An executable compiled in device emulation mode (nvcc –deviceemu) runs completely on the host using the CUDA runtime � No need of any device and CUDA driver � Each device thread is emulated with a host thread CUDA Tools and Threads – Slide 24

Debugging Using the Device Emulation Mode �Running in device emulation mode, one can � Use host native debug support (breakpoints, inspection, etc. ) � Access any device-specific data from host code and vice -versa � Call any host function from device code (e. g. printf) and vice-versa � Detect deadlock situations caused by improper usage of __syncthreads CUDA Tools and Threads – Slide 25

Device Emulation Mode Pitfalls �Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results �Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode CUDA Tools and Threads – Slide 26

Floating Point �Results of floating-point computations will slightly differ because of � Different compiler outputs, instructions sets � Use of extended precision for intermediate results �There are various options to force strict single precision on the host CUDA Tools and Threads – Slide 27

Nexus �New Visual Studio Based GPU Integrated Development �http: //developer. nvidia. com/object/nexus. html �Available in Beta (as of October 2009) CUDA Tools and Threads – Slide 28

End Credits �Based on original material from � http: //en. wikipedia. com/wiki/CUDA, accessed 6/22/2011. � The University of Akron: Charles Van Tilburg � The University of Illinois at Urbana-Champaign �David Kirk, Wen-mei W. Hwu � Oxford University: Mike Giles � Stanford University �Jared Hoberock, David Tarjan �Revision history: last updated 6/23/2011. CUDA at the University of Akron – Slide 29