Structuring Parallel Algorithms David KirkNVIDIA and Wenmei W

Slides from: • David Kirk/NVIDIA and Wen-mei W. Hwu, • ECE 498 AL, University

Key Parallel Programming Steps 1) To find the concurrency in the problem 2) To

Algorithm • A step by step procedure that is guaranteed to terminate, such that

Choosing Algorithm Structure Start Task-centric Data Flow centric Data-centric Linear Recursive Regular Irregular Linear

A simplementation • Assume we have already loaded array into __shared__ float partial. Sum[]

Mapping a Divide and Conquer Algorithm Thread 0 0 1 0+1 2 0. .

Tiled (Stenciled) Algorithms are Important for Geometric Decomposition bx 0 A framework for memory

Increased Work per Thread for even more locality bx 0 1 2 tx •

Double Buffering - a frequently used algorithm pattern • One could double buffer the

bx Double Buffering 1 ty tx TILE_WIDTH Nd 2 © David Kirk/NVIDIA and Wen-mei

One can trade more work for increased parallelism • Diamond search algorithm for motion

An MPEG Algorithm based on Data Parallelism Communication • Loops distributed – DOALL style

Loop fusion & memory privatization • Stage loops fused into single DOALL macroblock loop

Pipeline or “Spatial Computing” Model • Each PE performs as one pipeline stage in

Small decisions in algorithms can have major effect on parallelism. (H. 263 motion estimation

Slides: 17

Download presentation

Structuring Parallel Algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 1

Slides from: • David Kirk/NVIDIA and Wen-mei W. Hwu, • ECE 498 AL, University of Illinois, Urbana. Champaign • Podcasts of their lectures (recommended) : – http: //courses. ece. illinois. edu/ece 498/al/Syllabus. html

Key Parallel Programming Steps 1) To find the concurrency in the problem 2) To structure the algorithm to translate concurrency into performance 3) To implement the algorithm in a suitable programming environment 4) To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 3

Algorithm • A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer – Definiteness – the notion that each step is precisely stated – Effective computability – each step can be carried out by a computer – Finiteness – the procedure terminates • Multiple algorithms can be used to solve the same problem – Some require fewer steps – Some exhibit more parallelism – Some have larger memory footprint than others © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 4

Choosing Algorithm Structure Start Task-centric Data Flow centric Data-centric Linear Recursive Regular Irregular Linear Recursive Task Parallelism Divide and Conquer Pipeline Event Driven Geometric Decomposition Recursive Data © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 5

A simplementation • Assume we have already loaded array into __shared__ float partial. Sum[] unsigned int t = thread. Idx. x; for (unsigned int stride = 1; stride < block. Dim. x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partial. Sum[t] += partial. Sum[t+stride]; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 6

Mapping a Divide and Conquer Algorithm Thread 0 0 1 0+1 2 0. . . 3 Thread 2 1 2 2+3 Thread 4 3 4 4+5 4. . 7 3 0. . 7 Thread 6 5 6 6+7 Thread 8 7 8 8+9 Thread 10 9 10 11 10+11 8. . 15 iterations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Array elements 7

Tiled (Stenciled) Algorithms are Important for Geometric Decomposition bx 0 A framework for memory data sharing and reuse by increasing data access locality. bsize-1 N M P Psub ty bsize-1 A convenient framework for organizing threads 2 (tasks) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign WIDTH • 012 WIDTH – Tiled access patterns allow small cache/scartchpad memories to hold on to data for re-use. For matrix multiplication, a 16 X 16 thread block perform 0 loads from 2*256 = 512 float device memory for 256 * 0 (2*16) = 8, 192 mul/add 12 operations. by 1 tx BLOCK_WIDTH – 2 BLOCK_SIZE • 1 BLOCK_WIDTH WIDTH 8

Increased Work per Thread for even more locality bx 0 1 2 tx • – Nd TILE_WIDTH Each thread computes two element of Pdsub Reduced loads from global memory (Md) to shared memory Reduced instruction overhead TILE_WIDTH • • More work done in each iteration WIDTH 0 1 2 TILE_WIDTH-1 Pd Md 1 Pdsub TILE_WIDTH-1 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign TILE_WIDTH Pdsub WIDTH by ty 0 1 2 TILE_WIDTHE 0 TILE_WIDTH 9

Double Buffering - a frequently used algorithm pattern • One could double buffer the computation, getting better instruction mix within each thread – This is classic software pipelining in ILP compilers Loop { Load current tile to shared memory Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Compute current tile Load next tile from global memory syncthreads() Compute current tile }© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign syncthreads() 10

bx Double Buffering 1 ty tx TILE_WIDTH Nd 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign WIDTH 0 1 2 TILE_WIDTH-1 Pd Pdsub TILE_WIDTH-1 TILE_WIDTH 2 WIDTH by 1 2 1 TILE_WIDTHE • Deposit blue tile from register into shared memory • Syncthreads • Load orange tile into register • Compute Blue tile • Deposit orange tile into shared Md memory 0 • …. 0 0 TILE_WIDTH 11

One can trade more work for increased parallelism • Diamond search algorithm for motion estimation work efficient but sequential – Popular in traditional CPU implementations • Exhaustive Search totally parallel but work inefficient – Popular in HW and parallel implementations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 12

An MPEG Algorithm based on Data Parallelism Communication • Loops distributed – DOALL style – Replicates instructions and tables across accelerators • If instructions and data are too large for local memory… – Large memory transfers required to preserve data – Saturation of and contention for communication resources can leave computation resources idle Memory bandwidth constrains performance © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 13

Loop fusion & memory privatization • Stage loops fused into single DOALL macroblock loop – Memory privatization reduces main memory access • Replicates instructions and tables across processors – Local memory constraints may prevent this technique Novel dimensions of parallelism reduce communication © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 14

Pipeline or “Spatial Computing” Model • Each PE performs as one pipeline stage in macroblock processing • Imbalanced stages result in idle resources • Takes advantage of direct, accelerator to accelerator communication • Not very effective in CUDA but can be effective for Cell cient© David point-to-point communication can enable new mo Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 15 ECE 498 AL, University of Illinois, Urbana-Champaign

Small decisions in algorithms can have major effect on parallelism. (H. 263 motion estimation example) • Different algorithms may expose different levels of parallelism while achieving desired result • In motion estimation, can use previous vectors (either from space or time) as guess vectors © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 16