Structuring Parallel Algorithms David KirkNVIDIA and Wenmei W

  • Slides: 17
Download presentation
Structuring Parallel Algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498

Structuring Parallel Algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 1

Slides from: • David Kirk/NVIDIA and Wen-mei W. Hwu, • ECE 498 AL, University

Slides from: • David Kirk/NVIDIA and Wen-mei W. Hwu, • ECE 498 AL, University of Illinois, Urbana. Champaign • Podcasts of their lectures (recommended) : – http: //courses. ece. illinois. edu/ece 498/al/Syllabus. html

Key Parallel Programming Steps 1) To find the concurrency in the problem 2) To

Key Parallel Programming Steps 1) To find the concurrency in the problem 2) To structure the algorithm to translate concurrency into performance 3) To implement the algorithm in a suitable programming environment 4) To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 3

Algorithm • A step by step procedure that is guaranteed to terminate, such that

Algorithm • A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer – Definiteness – the notion that each step is precisely stated – Effective computability – each step can be carried out by a computer – Finiteness – the procedure terminates • Multiple algorithms can be used to solve the same problem – Some require fewer steps – Some exhibit more parallelism – Some have larger memory footprint than others © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 4

Choosing Algorithm Structure Start Task-centric Data Flow centric Data-centric Linear Recursive Regular Irregular Linear

Choosing Algorithm Structure Start Task-centric Data Flow centric Data-centric Linear Recursive Regular Irregular Linear Recursive Task Parallelism Divide and Conquer Pipeline Event Driven Geometric Decomposition Recursive Data © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 5

A simplementation • Assume we have already loaded array into __shared__ float partial. Sum[]

A simplementation • Assume we have already loaded array into __shared__ float partial. Sum[] unsigned int t = thread. Idx. x; for (unsigned int stride = 1; stride < block. Dim. x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partial. Sum[t] += partial. Sum[t+stride]; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 6

Mapping a Divide and Conquer Algorithm Thread 0 0 1 0+1 2 0. .

Mapping a Divide and Conquer Algorithm Thread 0 0 1 0+1 2 0. . . 3 Thread 2 1 2 2+3 Thread 4 3 4 4+5 4. . 7 3 0. . 7 Thread 6 5 6 6+7 Thread 8 7 8 8+9 Thread 10 9 10 11 10+11 8. . 15 iterations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Array elements 7

Tiled (Stenciled) Algorithms are Important for Geometric Decomposition bx 0 A framework for memory

Tiled (Stenciled) Algorithms are Important for Geometric Decomposition bx 0 A framework for memory data sharing and reuse by increasing data access locality. bsize-1 N M P Psub ty bsize-1 A convenient framework for organizing threads 2 (tasks) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign WIDTH • 012 WIDTH – Tiled access patterns allow small cache/scartchpad memories to hold on to data for re-use. For matrix multiplication, a 16 X 16 thread block perform 0 loads from 2*256 = 512 float device memory for 256 * 0 (2*16) = 8, 192 mul/add 12 operations. by 1 tx BLOCK_WIDTH – 2 BLOCK_SIZE • 1 BLOCK_WIDTH WIDTH 8

Increased Work per Thread for even more locality bx 0 1 2 tx •

Increased Work per Thread for even more locality bx 0 1 2 tx • – Nd TILE_WIDTH Each thread computes two element of Pdsub Reduced loads from global memory (Md) to shared memory Reduced instruction overhead TILE_WIDTH • • More work done in each iteration WIDTH 0 1 2 TILE_WIDTH-1 Pd Md 1 Pdsub TILE_WIDTH-1 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign TILE_WIDTH Pdsub WIDTH by ty 0 1 2 TILE_WIDTHE 0 TILE_WIDTH 9

Double Buffering - a frequently used algorithm pattern • One could double buffer the

Double Buffering - a frequently used algorithm pattern • One could double buffer the computation, getting better instruction mix within each thread – This is classic software pipelining in ILP compilers Loop { Load current tile to shared memory Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Compute current tile Load next tile from global memory syncthreads() Compute current tile }© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign syncthreads() 10

bx Double Buffering 1 ty tx TILE_WIDTH Nd 2 © David Kirk/NVIDIA and Wen-mei

bx Double Buffering 1 ty tx TILE_WIDTH Nd 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign WIDTH 0 1 2 TILE_WIDTH-1 Pd Pdsub TILE_WIDTH-1 TILE_WIDTH 2 WIDTH by 1 2 1 TILE_WIDTHE • Deposit blue tile from register into shared memory • Syncthreads • Load orange tile into register • Compute Blue tile • Deposit orange tile into shared Md memory 0 • …. 0 0 TILE_WIDTH 11

One can trade more work for increased parallelism • Diamond search algorithm for motion

One can trade more work for increased parallelism • Diamond search algorithm for motion estimation work efficient but sequential – Popular in traditional CPU implementations • Exhaustive Search totally parallel but work inefficient – Popular in HW and parallel implementations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 12

An MPEG Algorithm based on Data Parallelism Communication • Loops distributed – DOALL style

An MPEG Algorithm based on Data Parallelism Communication • Loops distributed – DOALL style – Replicates instructions and tables across accelerators • If instructions and data are too large for local memory… – Large memory transfers required to preserve data – Saturation of and contention for communication resources can leave computation resources idle Memory bandwidth constrains performance © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 13

Loop fusion & memory privatization • Stage loops fused into single DOALL macroblock loop

Loop fusion & memory privatization • Stage loops fused into single DOALL macroblock loop – Memory privatization reduces main memory access • Replicates instructions and tables across processors – Local memory constraints may prevent this technique Novel dimensions of parallelism reduce communication © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 14

Pipeline or “Spatial Computing” Model • Each PE performs as one pipeline stage in

Pipeline or “Spatial Computing” Model • Each PE performs as one pipeline stage in macroblock processing • Imbalanced stages result in idle resources • Takes advantage of direct, accelerator to accelerator communication • Not very effective in CUDA but can be effective for Cell cient© David point-to-point communication can enable new mo Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 15 ECE 498 AL, University of Illinois, Urbana-Champaign

Small decisions in algorithms can have major effect on parallelism. (H. 263 motion estimation

Small decisions in algorithms can have major effect on parallelism. (H. 263 motion estimation example) • Different algorithms may expose different levels of parallelism while achieving desired result • In motion estimation, can use previous vectors (either from space or time) as guess vectors © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 16