Synchronization These notes introduce Ways to achieve thread

Thread Barrier Synchronization When we divide a computation into parallel parts to be done

CUDA synchronization CUDA provides a synchronization barrier routine for those threads within each block

Threads only synchronize with other threads in the block Kernel code Block 0 Block

__syncthreads() constraints All threads must reach a particular __syncthreads() routine or deadlock occurs. Multiple

Global Kernel Barrier Unfortunately no global kernel barrier routine available in CUDA. Often we

Achieving global synchronization through multiple kernel launches Kernel launches efficiently implemented: - Minimal hardware

Code Example N-body problem Need to compute forces on each body in each time

Reasoning behind not having CUDA global synchronization within GPU Expensive to implement for a

Other ways to achieve global synchronization (if it cannot be avoided) • CUDA memory

Discussion points • Using writing to global memory to enforce synchronization expensive 11

Slides: 12

Download presentation

Synchronization These notes introduce: • Ways to achieve thread synchronization. • __syncthreads() • cuda. Thread. Synchronize() ITCS 4/5145 Parallel Programming, B. Wilkinson, July 11, 2012. CUDASynchronization. ppt 1

Thread Barrier Synchronization When we divide a computation into parallel parts to be done concurrently by independent threads, often need all threads to do their computation before processing next stage of computation In parallel programming, we call this barrier synchronization – all threads wait when they reach the barrier until all the threads have reached that point and then they are all released to continue 2

CUDA synchronization CUDA provides a synchronization barrier routine for those threads within each block __syncthreads() This routine would be used within a kernel. Threads would waits at this point until all threads in the block have reached it and they are all released. NOTE only synchronizes with other threads in block 3

Threads only synchronize with other threads in the block Kernel code Block 0 Block n-1 Barrier Continue __global void mykernel () {. . . __syncthreads(). . . } Separate barriers 4

__syncthreads() constraints All threads must reach a particular __syncthreads() routine or deadlock occurs. Multiple __syncthreads() can be used in a kernel but each one is unique. Hence cannot have: if {. . . __syncthreads(); } else { … __syncthreads(); } and expect threads going thro different paths to be synchronized. They all must go through the if or all go through the else clause. 5

Global Kernel Barrier Unfortunately no global kernel barrier routine available in CUDA. Often we want to synchronized all threads in computation. To do that, have to use workarounds such as returning from kernel and placing a barrier in CPU code. The following could be used in the CPU code: … my. Kernel<<<B, T>>>( … ); cuda. Thread. Synchronize(); … which waits until all preceding commands in all “streams” have completed. cuda. Thread. Synchronize() not needed if there is an existing synchronous CUDA call such as cuda. Memcpy(). 6

Achieving global synchronization through multiple kernel launches Kernel launches efficiently implemented: - Minimal hardware overhead - Little software overhead So could do: for (i= 0; i < n; i++) { my. Kernel<<<B, T>>>( … ); cuda. Thread. Synchronize(); } Recursion -- not allowed within kernel but can be used in host code to launch kernels 7

Code Example N-body problem Need to compute forces on each body in each time interval and then update positions and velocities of bodies and then repeat. for (t = 0; t < tmax; t++) { // for each time period, force calculation on all bodies cuda. Memcpy(dev_A, A , array. Size, cuda. Memcpy. Host. To. Device); // data to GPU body. Cal<<<B, T>>>(dev_A); cuda. Memcpy(A, dev_A, array. Size, cuda. Memcpy. Device. To. Host); // kernel call // updated data } // end of time period loop No explicit synchronization needed as cuda. Memcpy provides that here. 8

Reasoning behind not having CUDA global synchronization within GPU Expensive to implement for a large number of GPU processors. At the block level, allows blocks to be executed in any order on GPU. Can use different sizes of blocks depending upon the resources of GPU – so-called “transparent scalability. ” 9

Other ways to achieve global synchronization (if it cannot be avoided) • CUDA memory fence __threadfence() that waits to memory operations to be visible to other threads but probably is not useable for synchronization. • Write your own code for the kernel that implements global synchronization. How? (Using atomics and critical sections see next). 10

Discussion points • Using writing to global memory to enforce synchronization expensive 11

Questions