Pattern Parallel Programming B Wilkinson Pattern Prog Intro

Pattern Parallel Programming B. Wilkinson Pattern. Prog. Intro. ppt Modification date: Feb 21, 2016 1

Traditional programming approach • Explicitly specify messagepassing (MPI) • Low-level threads APIs (Pthreads, Java threads, Open. MP, …). Both require programmers to use low-level routines Need to make parallel programming easier, more structured and more scalable, especially in an educational environment 2

Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. Design patterns - part of software engineering for many years: • Reusable solutions to commonly occurring problems * • Patterns provide guide to “best practices”, not a final implementation • Provides good scalable design structure • Avoids common problem with ad-hoc designs • Can reason more easily about programs and debug * http: //en. wikipedia. org/wiki/Design_pattern_(computer_science) 3

Parallel Patterns -- Advantages • • Abstracts/hides underlying computing environment Generally avoids deadlocks and race conditions Reduces source code size (lines of code) Leads to automated conversion into parallel programs without need to write with low level MPI message-passing routines. • Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns. Disadvantages • New approach to learn • Takes away some of the freedom from programmer • Performance reduced (c. f. using high level languages instead of assembly language) 4

What parallel design patterns are we talking about? • Low-level patterns: • • • fork-join point-to point broadcast scatter gather, reduce, . . . Higher level patterns forming a complete computation: • master-slave • workpool, • pipeline • divide and conquer • stencil • map-reduce, . . . 5

Low level MPI message-passing patterns MPI point-to-point Data Transfer (Send-Receive) Source Data Destination 6

Collective patterns Broadcast Pattern Sends same data to each of a group of processes. A common pattern to get same data to all processes, especially at beginning of a computation Same data sent to all destinations Destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation. 7

Scatter Pattern Distributes a collection of data items to a group of processes. A common pattern to get data to all processes. Destinations Different data sent to each destinations Source Usually data sent are parts of an array 8

Gather pattern Sources Essentially reverse of scatter pattern. It receives data items from a group of processes Destination Data collected at destination in an array Data Common pattern especially at the end of a computation to collect results. 9

Reduce Pattern A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Reduction needs to be associative operation (e. g. 3 + (4 + 5) = (3 + 4) + 5) to allow the implementation to do the operations in any order. Also being communicative (e. g. 3 + 4 = 4 + 3) allows more flexibility in the parallel implementation. Sources Destination Data Reduction operation Data collected at destination and combined to get one answer with a commutative operation Note subtraction is not associative e. g. 3 – (4 – 5) != (3 – 4) – 5 but one can use addition with negative numbers 10

Collective all-to-all broadcast Sources and destinations are the same processes Sources Destinations A common all-to-all pattern, often within a computation, is to send data from all processes to all processes Every process sends data to every other process (one-way) Versions of this can be found in MPI. 11

Some Higher Level Message. Passing Patterns Slaves Master/slave Master Two-way connection Computation divided into parts, which are then passed out to slaves to perform and return their results, basis of most parallel computing Compute node Source/sink 12

Workpool Slaves/Workers Another task if task queue not empty Result Task from task queue Task queue Aggregate answers Master Once a slave completes a task, slave given another task from task queue master -- load-balancing quality. Need to differentiate between master-slave pattern, which does not imply a task queue, and workpool with task queue. 13

More Specialized High-level Patterns Pipeline Stage 1 Stage 2 Stage 3 Stage n Slaves (workers) Master One-way connection Compute node Source/sink 14

Divide and Conquer Divide Merge Two-way connection Compute node Source/sink 15

All-to-All compute nodes can communicate with all the other nodes Two-way connection Compute node Master Source/sink 16

Stencil All compute nodes can communicate with only neighboring nodes Usually a synchronous computation - Performs number of iterations to converge on solution, e. g. solving Laplace’s/heat equation Two-way connection Compute node Source/sink On each iteration, each node communicates with neighbors to get stored computed values 17

Iterative synchronous patterns • When a pattern is repeated until some termination condition occurs. • Synchronization at each iteration, to establish termination condition, often a global condition. • Note this is two patterns merged together sequentially if we call Repeat iteration a pattern. Pattern Check termination condition Stop

Iterative synchronous stencil pattern Stencil: All compute nodes can communicate with only neighboring nodes Applications: Repeat Check termination condition Solving Laplace’s/heat equation perform number of iterations to converge on solution. Stop 19

Iterative synchronous all-to-all pattern Repeat Check termination condition Stop Example: N-body problem needs an “iterative synchronous all-to-all” pattern, where on each iteration all processes exchange data with each other. 20

Previous/Existing Work Patterns explored in several projects. • Industrial efforts – Intel Threading Building Blocks (TBB), Cilk plus, Array Building Blocks (Ar. BB). Focus on very low level patterns such as fork-join • Universities: – University of Illinois at Urbana-Champaign and University of California, Berkeley – University of Torino/Università di Pisa Italy “Structured Parallel Programming: Patterns for Efficient Computation, ” Michael Mc. Cool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Intel tools, TBB, Cilk, Ar. BB 21

Our approach We have developed several tools at different levels of abstraction that avoid using low level MPI and enable students to create working patterns very quickly. • Suzaku framework – provides pre-written pattern-based routines and macros that hide the MPI code. Low level patterns, workpool, . . • Paraguin compiler – Compiler directive approach that creates MPI code. Patterns implemented include scatter-gather for a master slave pattern, stencil, … • Seeds framework – high-level Java-based software. Many patterns implemented including workpool, pipeline, synchronous iterative all-to-all, stencil. Self deploys and executes on any platform, local computers or distributed computers Historical Seeds was developed first as part of a UNC-C Ph. D project by Jeremy Villalobos, 2007 -2011. 22

Acknowledgements The Seeds framework was developed by Jeremy Villalobos in his Ph. D thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability, ” UNC-Charlotte, 2011. Extending work to teaching environment supported by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" #1141005/1141006 (2012 -2015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Questions