CUDA Odds and Ends Joseph Kider University of

Sources Patrick Cozzi Spring 2011 n NVIDIA CUDA Programming Guide n CUDA by Example

Agenda Atomic Functions n Paged-Locked Host Memory n Streams n Graphics Interoperability n

Atomic Functions n What is the value of count if 8 threads execute ++count?

Atomic Functions n Read-modify-write atomic operation ¨ Guaranteed no interference from other threads ¨

Atomic Functions n What is the value of count if 8 threads execute atomic.

Atomic Functions n How do you implement atomic. Inc? __device__ int atomic. Add( int

Atomic Functions n How do you implement atomic. Inc without locking?

Atomic Functions How do you implement atomic. Inc without locking? n What if you

Atomic Functions n atomic. CAS pseudo implementation int atomic. CAS(int *address, int compare, int

Atomic Functions n Example: *addr = 1; atomic. CAS(addr, 1, 2); atomic. CAS(addr, 1,

Atomic Functions n Again, how do you implement atomic. Inc given atomic. CAS? __device__

$Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address,$

$Atomic Functions __device__ int atomic. Add(int *address, int val) { Read original value at$

$Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address,$

Atomic Functions n Lots of atomics: // Arithmetic atomic. Add() atomic. Sub() atomic. Exch()

Atomic Functions How can threads from different blocks work together? n Use atomics sparingly.

Page-Locked Host Memory n Page-locked Memory ¨ Host memory that is essentially removed from

Page-Locked Host Memory n Benefits ¨ Overlap kernel execution and data transfers Time Normally:

Page-Locked Host Memory n Benefits ¨ Increased memory bandwidth for systems with a front-side

Page-Locked Host Memory n Benefits ¨ Writing-Combing Memory Page-locked memory is cacheable n Allocate

Page-Locked Host Memory n Benefits ¨ Paged-locked host memory can be mapped into the

Page-Locked Host Memory n Usage: cuda. Host. Alloc() / cuda. Malloc. Host() cuda. Host.

Page-Locked Host Memory DEMO CUDA SDK Example: bandwidth. Test

Page-Locked Host Memory n What’s the catch? ¨ Page-locked n memory is scarce Allocations

Streams Stream: Sequence of commands that execute in order n Streams may execute their

Streams n Is this a possible order? Stream A Stream B Time Command 0

Streams n In CUDA, what commands go in a stream? ¨ Kernel ¨ Host

Streams n Code Example Create two streams 2. Each stream: 1. 2. 3. Copy

Stream Example (Step 1 of 3) cuda. Stream_t stream[2]; for (int i = 0;

Stream Example (Step 2 of 3) for (int i = 0; i < 2;

Stream Example (Step 3 of 3) for (int i = 0; i < 2;

Streams n Assume compute capabilities: ¨ Overlap of data transfer and kernel execution ¨

Streams n Can we have more overlap than this? Time Stream A Host Stream

Streams n Can we have this? Time Stream A Host Stream B device memory

Streams n Implicit Synchronization ¨ An operation that requires a dependency check to see

Streams n Can we have this? Stream A Host Time device memory Stream B

Streams n Performance Advice ¨ Issue all independent commands before dependent ones ¨ Delay

Streams n Rewrite this to allow concurrent kernel execution for (int i = 0;

Streams for (int i = 0; i < 2; ++i) // to device cuda.

Streams n Explicit Synchronization ¨ cuda. Thread. Synchronize() n Blocks until commands in all

Timing with Stream Events can be added to a stream to monitor the device’s

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop)

Graphics Interoperability n What applications use both CUDA and Open. GL/Direct 3 D? ¨

Graphics Interoperability n Graphics Interop: Map GL resource into CUDA address space ¨ Buffers

Graphics Interoperability n Open. GL Buffer Interop 1. Assign device with GL interop n

Slides: 64

Download presentation

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

Sources Patrick Cozzi Spring 2011 n NVIDIA CUDA Programming Guide n CUDA by Example n Programming Massively Parallel Processors n

Agenda Atomic Functions n Paged-Locked Host Memory n Streams n Graphics Interoperability n

Atomic Functions n What is the value of count if 8 threads execute ++count? __device__ unsigned int count = 0; //. . . ++count;

Atomic Functions n Read-modify-write atomic operation ¨ Guaranteed no interference from other threads ¨ No guarantee on order Shared or global memory n Requires compute capability 1. 1 (> G 80) n See G. 1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

Atomic Functions n What is the value of count if 8 threads execute atomic. Inc below? __device__ unsigned int count = 0; //. . . // atomic ++count atomic. Inc(&count, 1);

Atomic Functions n How do you implement atomic. Inc? __device__ int atomic. Add( int *address, int val);

Atomic Functions n How do you implement atomic. Inc? __device__ int atomic. Add( int *address, int val) { // Made up keyword: __lock (address) { *address += value; } }

Atomic Functions n How do you implement atomic. Inc without locking?

Atomic Functions How do you implement atomic. Inc without locking? n What if you were given an atomic compare and swap? n int atomic. CAS(int *address, int compare, int val);

Atomic Functions n atomic. CAS pseudo implementation int atomic. CAS(int *address, int compare, int val) { // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; } }

Atomic Functions n Example: *addr = 1; atomic. CAS(addr, 1, 2); atomic. CAS(addr, 1, 3); atomic. CAS(addr, 2, 3);

Atomic Functions n Example: *addr = 1; atomic. CAS(addr, 1, 2); atomic. CAS(addr, 1, 3); atomic. CAS(addr, 2, 3); // returns 1 // *addr = 2

Atomic Functions n Example: *addr = 1; atomic. CAS(addr, 1, 2); atomic. CAS(addr, 1, 3); atomic. CAS(addr, 2, 3); // returns 2 // *addr = 2

Atomic Functions n Example: *addr = 1; atomic. CAS(addr, 1, 2); atomic. CAS(addr, 1, 3); atomic. CAS(addr, 2, 3); // returns 2 // *addr = 3

Atomic Functions n Again, how do you implement atomic. Inc given atomic. CAS? __device__ int atomic. Add( int *address, int val);

$Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address,$

Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomic. CAS(address, assumed, val + assumed); } while (assumed != old); return old; }

$Atomic Functions __device__ int atomic. Add(int *address, int val) { Read original value at$

Atomic Functions __device__ int atomic. Add(int *address, int val) { Read original value at int old = *address, assumed; *address. do { assumed = old; old = atomic. CAS(address, assumed, val + assumed); } while (assumed != old); return old; }

$Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address,$

Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address, assumed; do { assumed = old; If the value at old = atomic. CAS(address, *address didn’t assumed, val + assumed); change, increment it. } while (assumed != old); return old; }

$Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address,$

Atomic Functions __device__ int atomic. Add(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomic. CAS(address, assumed + val); } while (assumed != old); Otherwise, loop until atomic. CAS succeeds. return old; The value of *address after this function } returns is not necessarily the original value of *address + val, why?

Atomic Functions n Lots of atomics: // Arithmetic atomic. Add() atomic. Sub() atomic. Exch() atomic. Min() atomic. Max() atomic. Inc() atomic. Dec() atomic. CAS() // Bitwise atomic. And() atomic. Or() atomic. Xor() See B. 10 in the NVIDIA CUDA C Programming Guide

Atomic Functions How can threads from different blocks work together? n Use atomics sparingly. Why? n

Page-Locked Host Memory n Page-locked Memory ¨ Host memory that is essentially removed from virtual memory ¨ Also called Pinned Memory

Page-Locked Host Memory n Benefits ¨ Overlap kernel execution and data transfers Time Normally: Data Transfer Paged-locked: Data Transfer Kernel Execution See G. 1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

Page-Locked Host Memory n Benefits ¨ Increased memory bandwidth for systems with a front-side bus n Up to ~2 x throughput Image from http: //arstechnica. com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives. ars

Page-Locked Host Memory n Benefits ¨ Writing-Combing Memory Page-locked memory is cacheable n Allocate with cuda. Host. Alloc. Write. Combined to n Avoid polluting L 1 and L 2 caches ¨ Avoid snooping transfers across PCIe ¨ Improve transfer performance up to 40% - in theory ¨ n Reading from write-combing memory is slow! ¨ Only write to it from the host

Page-Locked Host Memory n Benefits ¨ Paged-locked host memory can be mapped into the address space of the device on some systems What systems allow this? n What does this eliminate? n What applications does this enable? n ¨ Call cuda. Get. Device. Properties() and check can. Map. Host. Memory

Page-Locked Host Memory n Usage: cuda. Host. Alloc() / cuda. Malloc. Host() cuda. Host. Free() cuda. Memcpy. Async() See 3. 2. 5 in the NVIDIA CUDA C Programming Guide

Page-Locked Host Memory DEMO CUDA SDK Example: bandwidth. Test

Page-Locked Host Memory n What’s the catch? ¨ Page-locked n memory is scarce Allocations will start failing before allocation of in pageable memory ¨ Reduces amount of physical memory available to the OS for paging n Allocating too much will hurt overall system performance

Streams Stream: Sequence of commands that execute in order n Streams may execute their commands outof-order or concurrently with respect to other streams n Stream A Stream B Command 0 Command 1 Command 2

Streams n Is this a possible order? Stream A Stream B Time Command 0 Command 1 Command 2 Command 0 Command 1 Command 2

Streams n Is this a possible order? Stream A Stream B Time Command 0 Command 1 Command 0 Command 2 Command 1 Command 2

Streams n Is this a possible order? Stream A Stream B Time Command 0 Command 1 Command 0 Command 2 Command 1

Streams n Is this a possible order? Stream A Stream B Time Command 0 Command 1 Command 2

Streams n In CUDA, what commands go in a stream? ¨ Kernel ¨ Host launches device memory transfers

Streams n Code Example Create two streams 2. Each stream: 1. 2. 3. Copy page-locked memory to device Launch kernel Copy memory back to host Destroy streams

Stream Example (Step 1 of 3) cuda. Stream_t stream[2]; for (int i = 0; i < 2; ++i) { cuda. Stream. Create(&stream[i]); } float *host. Ptr; cuda. Malloc. Host(&host. Ptr, 2 * size);

Stream Example (Step 1 of 3) cuda. Stream_t stream[2]; for (int i = 0; i < 2; ++i) { cuda. Stream. Create(&stream[i]); } Create two streams float *host. Ptr; cuda. Malloc. Host(&host. Ptr, 2 * size);

Stream Example (Step 1 of 3) cuda. Stream_t stream[2]; for (int i = 0; i < 2; ++i) { cuda. Stream. Create(&stream[i]); } float *host. Ptr; cuda. Malloc. Host(&host. Ptr, 2 * size); Allocate two buffers in page-locked memory

Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cuda. Memcpy. Async(/*. . . */, cuda. Memcpy. Host. To. Device, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/*. . . */); cuda. Memcpy. Async(/*. . . */, cuda. Memcpy. Device. To. Host, stream[i]); }

Stream Example (Step 3 of 3) for (int i = 0; i < 2; ++i) { // Blocks until commands complete cuda. Stream. Destroy(stream[i]); }

Streams n Assume compute capabilities: ¨ Overlap of data transfer and kernel execution ¨ Concurrent data transfer n How can the streams overlap? See G. 1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities

Streams n Can we have more overlap than this? Time Stream A Host Stream B device memory Kernel execution Host Device Kernel execution to host memory Device device memory to host memory

Streams n Can we have this? Time Stream A Host Stream B device memory Kernel execution Host device memory Kernel execution Device to host memory

Streams n Implicit Synchronization ¨ An operation that requires a dependency check to see if a kernel finished executing: n Blocks all kernel launches from any stream until the checked kernel is finished See 3. 2. 6. 5. 3 in the NVIDIA CUDA C Programming Guide for all limitations

Streams n Can we have this? Stream A Host Time device memory Stream B Kernel execution Dependent on kernel completion Host Device to host memory Blocked until kernel from Stream A completes device memory Kernel execution Device to host memory

Streams n Performance Advice ¨ Issue all independent commands before dependent ones ¨ Delay synchronization (implicit or explicit) as long as possible

Streams n Rewrite this to allow concurrent kernel execution for (int i = 0; i < 2; ++i) { cuda. Memcpy. Async(/*. . . */, stream[i]); kernel<<< /*. . . */ stream[i]>>>(); cuda. Memcpy. Async(/*. . . */, stream[i]); }

Streams for (int i = 0; i < 2; ++i) // to device cuda. Memcpy. Async(/*. . . */, stream[i]); for (int i = 0; i < 2; ++i) kernel<<< /*. . . */ stream[i]>>>(); for (int i = 0; i < 2; ++i) // to host cuda. Memcpy. Async(/*. . . */, stream[i]);

Streams n Explicit Synchronization ¨ cuda. Thread. Synchronize() n Blocks until commands in all streams finish ¨ cuda. Stream. Synchronize() n Blocks until commands in a stream finish See 3. 2. 6. 5 in the NVIDIA CUDA C Programming Guide for more synchronization functions

Timing with Stream Events can be added to a stream to monitor the device’s progress n An event is completed when all commands in the stream preceding it complete. n

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop) cuda. Event. Record(start, 0); for (int i = 0; i < 2; ++i) //. . . cuda. Event. Record(stop, 0); cuda. Event. Synchronize(stop); float elapsed. Time; cuda. Event. Elapsed. Time(&elapsed. Time, start, stop); // cuda. Event. Destroy(. . . )

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop) Create two events. Each will record the time cuda. Event. Record(start, 0); for (int i = 0; i < 2; ++i) //. . . cuda. Event. Record(stop, 0); cuda. Event. Synchronize(stop); float elapsed. Time; cuda. Event. Elapsed. Time(&elapsed. Time, start, stop); // cuda. Event. Destroy(. . . )

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop) cuda. Event. Record(start, 0); Record events before for (int i = 0; i < 2; ++i) and after each stream is assigned its work //. . . cuda. Event. Record(stop, 0); cuda. Event. Synchronize(stop); float elapsed. Time; cuda. Event. Elapsed. Time(&elapsed. Time, start, stop); // cuda. Event. Destroy(. . . )

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop) cuda. Event. Record(start, 0); for (int i = 0; i < 2; ++i) //. . . cuda. Event. Record(stop, 0); Delay addition commands in cuda. Event. Synchronize(stop); stream until after the stop event float elapsed. Time; cuda. Event. Elapsed. Time(&elapsed. Time, start, stop); // cuda. Event. Destroy(. . . )

Timing with Stream Events cuda. Event_t start, stop; cuda. Event. Create(&start); cuda. Event. Create(&stop) cuda. Event. Record(start, 0); for (int i = 0; i < 2; ++i) //. . . cuda. Event. Record(stop, 0); cuda. Event. Synchronize(stop); Compute elapsed time float elapsed. Time; cuda. Event. Elapsed. Time(&elapsed. Time, start, stop); // cuda. Event. Destroy(. . . )

Graphics Interoperability n What applications use both CUDA and Open. GL/Direct 3 D? ¨ CUDA ¨ GL n GL CUDA If CUDA and GL cannot share resources, what is the performance implication?

Graphics Interoperability n Graphics Interop: Map GL resource into CUDA address space ¨ Buffers ¨ Textures ¨ Renderbuffers

Graphics Interoperability n Open. GL Buffer Interop 1. Assign device with GL interop n 2. Register GL resource with CUDA n 3. cuda. Graphics. GLRegister. Buffer() Map it n 4. cuda. GLSet. GLDevice() cuda. Graphics. Map. Resources() Get mapped pointer n cuda. Graphics. Resource. Get. Mapped. Pointer()