Towards a Science of Parallel Programming Keshav Pingali

  • Slides: 35
Download presentation
Towards a Science of Parallel Programming Keshav Pingali University of Texas, Austin 1

Towards a Science of Parallel Programming Keshav Pingali University of Texas, Austin 1

Problem Statement • Community has worked on parallel programming for more than 30 years

Problem Statement • Community has worked on parallel programming for more than 30 years – – • programming models machine models programming languages …. However, parallel programming is still a research problem – matrix computations, stencil computations, FFTs etc. are well-understood – each new application is a “new phenomenon” • few insights for irregular applications • Thesis: we need a science of parallel programming – analysis: framework for thinking about parallelism in application – synthesis: produce an efficient parallel implementation of application “The Alchemist” Cornelius Bega (1663) 2

Analogy: science of electro-magnetism Seemingly unrelated phenomena Unifying abstractions Specialized models that exploit structure

Analogy: science of electro-magnetism Seemingly unrelated phenomena Unifying abstractions Specialized models that exploit structure 3

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – –

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – – • Unifying abstractions – – • Stencil codes Delaunay mesh refinement Event-driven simulation Graph reduction of functional languages … Operator formulation of algorithms Amorphous data-parallelism Galois programming model Baseline parallel implementation Specialized implementations that exploit structure – Structure of algorithms – Optimized compiler and runtime system support for different kinds of structure • Ongoing work 4

Some parallel algorithms 5

Some parallel algorithms 5

Examples Application/domain Algorithm Meshing Generation/refinement/partitioning Compilers Iterative and elimination-based dataflow algorithms Functional interpreters Graph

Examples Application/domain Algorithm Meshing Generation/refinement/partitioning Compilers Iterative and elimination-based dataflow algorithms Functional interpreters Graph reduction, static and dynamic dataflow Maxflow Preflow-push, augmenting paths Minimal spanning trees Prim, Kruskal, Boruvka Event-driven simulation Chandy-Misra-Bryant, Jefferson Timewarp AI Message-passing algorithms Stencil computations Jacobi, Gauss-Seidel, red-black ordering Sparse linear solvers Sparse MVM, sparse Cholesky factorization 6

Stencil computation: Jacobi iteration • Finite-difference method for solving PDEs – discrete representation of

Stencil computation: Jacobi iteration • Finite-difference method for solving PDEs – discrete representation of domain: grid • Values at interior points are updated using values at neighbors – values at boundary points are fixed • Data structure: (i-1, j) – dense arrays • Parallelism: – values at all interior points can be computed simultaneously – parallelism is not dependent on input values • (i, j-1) (i, j+1) (i, j) Compiler can find the parallelism – spatial loops are DO-ALL loops (i+1, j) //Jacobi iteration with 5 -point stencil //initialize array A for time = 1, nsteps for <i, j> in [2, n-1]x[2, n-1] temp(i, j)=0. 25*(A(i-1, j)+A(i+1, j)+A(i, j-1)+A(i, j+1)) for <i, j> in [2, n-1]x[2, n-1] A(i, j) = temp(i, j) 5 -point stencil 7

Delaunay Mesh Refinement Mesh m = /* read in meshto*/remove badly • Iterative refinement

Delaunay Mesh Refinement Mesh m = /* read in meshto*/remove badly • Iterative refinement shaped Work. List wl; triangles: while there are bad triangles do { wl. add(m. bad. Triangles()); a bad triangle; while (pick !wl. empty() ){ find its cavity; retriangulate cavity; may create new bad triangles Element e =// wl. get(); } if (e no longer in mesh) continue; • Don’t-care Cavity cnon-determinism: = new Cavity(e); //determine new cavity – c. expand(); final mesh depends on order in which bad triangles are processed c. retriangulate(); //re-triangulate region – applications do not care which mesh is produced m. update(c); //update mesh • Data structure: wl. add(c. bad. Triangles()); – graph in which nodes represent triangles and } edges represent triangle adjacencies • Parallelism: – bad triangles with cavities that do not overlap can be processed in parallel – parallelism is very “input-dependent” • compilers cannot determine this parallelism – (Miller et al. ) at runtime, repeatedly build interference graph and find maximal independent sets for parallel execution 8

Event-driven simulation • • Stations communicate by sending messages with time-stamps on FIFO channels

Event-driven simulation • • Stations communicate by sending messages with time-stamps on FIFO channels Stations have internal state that is updated when a message is processed Messages must be processed in timeorder at each station Data structure: – Messages in event-queue, sorted in timeorder • 2 A 4 6 B Parallelism: – conservative: Chandy-Misra-Bryant • • station fires when it has messages on all incoming edges and processes earliest message requires null messages to avoid deadlock 5 – optimistic: Jefferson time-warp • • station can fire when it has an incoming message on any edge requires roll-back if speculative conflict is detected 9

Remarks on algorithms • Diverse algorithms and data structures • Exploiting parallelism in irregular

Remarks on algorithms • Diverse algorithms and data structures • Exploiting parallelism in irregular algorithms is very complex – Miller et al. DMR implementation: interference graph + maximal independent sets – Jefferson Timewarp algorithm for event-driven simulation • Algorithms: – parallelism can be very input-dependent • DMR, event-driven simulation, graph reduction, …. – don’t-care non-determinism • has nothing to do with concurrency • DMR, graph reduction – activities created dynamically may interfere with existing activities • event-driven simulation… • Data structures: – relatively few algorithms use dense arrays – more common: graphs, trees, lists, priority queues, … 10

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – –

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – – • Stencil codes Delaunay mesh refinement Event-driven simulation Graph reduction of functional languages ……… Unifying abstractions – Amorphous data-parallelism – Baseline parallel implementation for exploiting amorphous data-parallelism • Specialized implementations that exploit structure – Structure of algorithms – Optimized compiler and runtime system support for different kinds of structure • Ongoing work 11

Requirements • Provide a model of parallelism in irregular algorithms • Unified treatment of

Requirements • Provide a model of parallelism in irregular algorithms • Unified treatment of parallelism in regular and irregular algorithms – parallelism in regular algorithms must emerge as a special case of general model – (cf. ) correspondence principles in Physics • Abstractions should be effective – should be possible to write an interpreter to execute algorithms in parallel 12

Traditional abstraction • Computation graph – nodes are computations – edges are dependences •

Traditional abstraction • Computation graph – nodes are computations – edges are dependences • Parallelism – width of the computation graph • Effective parallel computation graph model – dataflow model of Dennis, Arvind et al. • Inadequate for irregular applications – dependences between computations are a function of input data – don’t-care non-determinism – conflicting work may be created dynamically – … • Data structures play almost no role in this abstraction – in most programs, parallelism comes from data -parallelism (concurrent operations on data structure elements) • New abstraction – data-centric: data structures play a central role – we will use graph ADT to illustrate concepts 13

Operator formulation of algorithms • Algorithm = repeated application of operator to graph i

Operator formulation of algorithms • Algorithm = repeated application of operator to graph i 3 i 1 – active element: • node or edge where operator is applied – Jacobi: interior nodes of mesh – DMR: nodes representing bad triangles – Event-driven simulation: station with incoming message i 2 – neighborhood: • set of nodes and edges read/written to perform computation – Jacobi: nodes in stencil – DMR: cavity of bad triangle – Event-driven simulation: station • distinct usually from neighbors in graph – ordering: • order in which active elements must be executed in a sequential implementation – any order (Jacobi, DMR, graph reduction) – some problem-dependent order (eventdriven simulation) i 4 i 5 : active node : neighborhood 14

Parallelism • Amorphous data-parallelism: i 1 – parallelism in processing active nodes subject to

Parallelism • Amorphous data-parallelism: i 1 – parallelism in processing active nodes subject to • neighborhood constraints • ordering constraints • i 3 i 2 i 4 Computations at two active elements are independent if – Neighborhoods do not overlap – More generally, neither of them writes to an element in the intersection of the neighborhoods • i 5 Unordered active elements – In principle, independent active elements can be processed in parallel – How do we find independent active elements? • Ordered active elements – Independence is not enough since elements can become active dynamically (see example) – How do we determine what is safe to execute in parallel? • 2 A 3 B 6 4 C 5 How do we make this model effective? 15

Galois programming model (PLDI 2007) • • • Program written in terms of abstractions

Galois programming model (PLDI 2007) • • • Program written in terms of abstractions in model Programming model: sequential, OO Graph class: provided by Galois library – specialized versions to exploit structure (see later) • Galois set iterators: for iterating over unordered and ordered sets of active elements – for each e in Set S do B(e) • evaluate B(e) for each element in set S • no a priori order on iterations • set S may get new elements during execution Mesh m = /* read in mesh */ Set ws; ws. add(m. bad. Triangles()); // initialize ws for each tr in Set ws do { //unordered Set iterator if (tr no longer in mesh) continue; Cavity c = new Cavity(tr); c. expand(); c. retriangulate(); m. update(c); ws. add(c. bad. Triangles()); //bad triangles } – for each e in Ordered. Set S do B(e) • evaluate B(e) for each element in set S • perform iterations in order specified by Ordered. Set • set S may get new elements during execution DMR using Galois iterators 16

Galois parallel execution model • Parallel execution model: – shared-memory – optimistic execution of

Galois parallel execution model • Parallel execution model: – shared-memory – optimistic execution of Galois iterators • Implementation: – master thread begins execution of program – when it encounters iterator, worker threads help by executing iterations concurrently – barrier synchronization at end of iterator • main() …. for each …. . { ……. }. . . Master i 3 i 1 i 2 i 4 i 5 Independence of neighborhoods: – software TM variety – logical locks on nodes and edges • Ordering constraints for ordered set iterator: – execute iterations out of order but commit in order – cf. out-of-order CPUs Program Threads Shared Memory 17

Parameter tool (PPo. PP 2009) • Idealized execution model: – unbounded number of processors

Parameter tool (PPo. PP 2009) • Idealized execution model: – unbounded number of processors – applying operator at an active node takes one time step – execute a maximal set of active nodes, subject to neighborhood and ordering constraints • Measures amorphous data-parallelism in irregular program execution • Useful as an analysis tool 18

Example: DMR • Input mesh: – Produced by Triangle (Shewchuck) – 550 K triangles

Example: DMR • Input mesh: – Produced by Triangle (Shewchuck) – 550 K triangles – Roughly half are badly shaped • Available parallelism: – How many non-conflicting triangles can be expanded at each time step? • Parallelism intensity: – What fraction of the total number of bad triangles can be expanded at each step? 19

Examples • Boruvka MST algorithm – Builds MST bottom-up – Unordered active elements •

Examples • Boruvka MST algorithm – Builds MST bottom-up – Unordered active elements • Agglomerative clustering (AC) Boruvka: 10 K node graph, avg degree 5 – Data-mining algorithm – Ordered active elements • Similarity in parallelism profiles arises from similarity in algorithmic structure AC: 20 K random points in 2 D 20

Summary • • Old abstraction: computation graphs New abstraction: operator formulation of algorithms –

Summary • • Old abstraction: computation graphs New abstraction: operator formulation of algorithms – active elements – neighborhoods – ordering of active elements • Amorphous data-parallelism – generalizes conventional data-parallelism • i 3 i 1 Baseline execution model i 2 i 4 – Galois programming model • sequential, OO • uses new abstractions – optimistic parallel execution • Parameter tool i 5 – provides estimates of amorphous dataparallelism in programs written using Galois programming model 21

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – –

Organization of talk • Seemingly unrelated parallel algorithms and data structures – – – • Unifying abstractions – – • Stencil codes Delaunay mesh refinement Event-driven simulation Graph reduction of functional languages ……… Operator formulation of algorithms Amorphous data-parallelism Galois programming model Baseline parallel implementation Specialized implementations that exploit structure – Structure of algorithms – Optimized compiler and runtime system support for different kinds of structure • Ongoing work 22

Key idea • Baseline implementation is general but usually inefficient – (e. g. )

Key idea • Baseline implementation is general but usually inefficient – (e. g. ) dynamic scheduling of iterations is not needed for Jacobi since grid structure is known at compile-time – (e. g. ) hand-written parallel implementations of DMR and Jacobi do not buffer updates to neighborhood until commit point • Efficient execution requires exploiting structure in algorithms and data structures • How do we talk about structure in algorithms? – Previous approaches: like descriptive biology • • Mattson et al. book Parallel programming patterns (PPP): Snir et al. Berkeley dwarfs … – Our approach: like molecular biology • based on amorphous data-parallelism framework 23

Algorithm abstractions general graph topology grid tree morph: modifies structure of graph iterative algorithms

Algorithm abstractions general graph topology grid tree morph: modifies structure of graph iterative algorithms operator local computation: only updates values on nodes/edges reader: does not modify graph in any way ordering unordered Jacobi: topology: grid, operator: local computation, ordering: unordered DMR, graph reduction: topology: graph, operator: morph, ordering: unordered Event-driven simulation: topology: graph, operator: local computation, ordering: ordered 24

Morphs u uv n a m m n a v Edge contraction Node elimination

Morphs u uv n a m m n a v Edge contraction Node elimination refinement: DMR, Prim MST, Barnes-Hut tree building node elimination: sparse Cholesky factorization morph operator …. coarsening edge contraction: Metis, Kruskal MST, Boruvka MST, AC sub-graph elimination: elimination-based dataflow analysis general: graph reduction 25

Reducing Overheads of Optimistic Parallel Execution 26

Reducing Overheads of Optimistic Parallel Execution 26

Graph partitioning (ASPLOS 2008) Cores • Algorithm structure: – general graph/grid + unordered active

Graph partitioning (ASPLOS 2008) Cores • Algorithm structure: – general graph/grid + unordered active elements • Optimization I: – partition the graph/grid and work-set between cores – data-centric work assignment: core gets active elements from its own partition • Pros and cons: – eliminates contention for worklist – improves locality and can dramatically reduce conflicts – dynamic load-balancing may be needed • Optimization II: – lock coarsening: associate logical locks with partitions, not graph elements – reduces overhead of lock management • Over-decomposition may improve core utilization 27

Zero-copy implementation • Cautious operator: – reads all the elements in its neighborhood before

Zero-copy implementation • Cautious operator: – reads all the elements in its neighborhood before modifying any of them – (e. g. ) Delaunay mesh refinement • Algorithm structure: – cautious operator + unordered active elements • Optimization: optimistic execution w/o buffering updates – grab locks on elements during read phase • conflict: someone else has lock, so release your locks – once update phase begins, no new locks will be acquired • update in-place w/o making copies – note: this is not two-phase locking 28

Delaunay mesh refinement • Algorithm structure: – general graph/grid + cautious operator + unordered

Delaunay mesh refinement • Algorithm structure: – general graph/grid + cautious operator + unordered active elements • Optimizations: – partitioning + lock-coarsening + zerobuffering – very efficient implementations possible • Maverick@TACC – 128 -core Sun Fire E 25 K 1. 05 GHz – 64 dual-core processors – Sun Solaris • • Speed-up of 20 on 32 cores for refinement Mesh partitioning is still sequential – time for mesh partitioning starts to dominate after 8 processors (32 partitions) – Need parallel mesh partitioning 29

Survey propagation on Maverick • SP is a heuristic for solving difficult SAT problems

Survey propagation on Maverick • SP is a heuristic for solving difficult SAT problems • SP: general graph + cautious operator + unordered elements • Implementation: – partitioning – lock coarsening – zero-buffering Survey propagation on Maverick (roughly 1500 clauses, 250 -500 variables) 30

Eliminating the Need for Optimistic Parallel Execution 31

Eliminating the Need for Optimistic Parallel Execution 31

Scheduling • Baseline implementation – autonomous scheduling: no coordination between execution of different active

Scheduling • Baseline implementation – autonomous scheduling: no coordination between execution of different active elements • Global coordination possible for some algorithms – Run-time scheduling: cautious operator + unordered active elements • • execute all activities partially to determine neighborhoods create interference graph and find independent set of activities execute independent set of activities in parallel w/o synchronization used in Gary Miller’s implementation of DMR – Just-in-time scheduling: local computation + structure-driven + cautious, unordered (e. g. ) sparse MVM • Inspector-executor approach – Compile-time scheduling: previous case + graph is known at compile-time (e. g. ) Jacobi • make all scheduling decisions at compile-time 32

Ongoing work h 2 n 2 n 1 h 4 h 1 n 1

Ongoing work h 2 n 2 n 1 h 4 h 1 n 1 h 4 – n 4 important for some algorithms on dense graphs locality incorporating scheduling information into Galois • program refinements? Compiler analysis – analyze and optimize code for operators Runtime system – • h 4 h 3 Language/programming model: – • n 4 h 3 divide-and-conquer algorithms transforming ordered algorithms into unordered algorithms intra-operator parallelism • • h 3 n 4 n 3 Algorithm studies: – – – • n 2 n 1 n 3 • h 2 adaptive control system for managing threads Application studies – Case studies of hand-optimized codes • • – understand hand optimizations figure out how to incorporate them into system Lonestar benchmark suite for irregular programs • joint work with Calin Cascaval’s group at IBM Yorktown Heights 33

Acknowledgements (in alphabetical order) • • • • Kavita Bala (Cornell) Martin Burtscher (UT

Acknowledgements (in alphabetical order) • • • • Kavita Bala (Cornell) Martin Burtscher (UT Austin) Patrick Carribault (UT Austin) Calin Cascaval (IBM) Paul Chew (Cornell) Amber Hassaan (UT Austin) Tony Ingraffea (Cornell) Milind Kulkarni (UT Austin) Mario Mendez (UT Austin) Rajasekhar Inkulu (UT Austin) Donald Nguyen (UT Austin) Dimitrios Prountzos (UT Austin) Ganesh Ramanarayanan (Microsoft) Xin Sui (UT Austin) Bruce Walter (Cornell) Zifei Zhong (UT Austin) 34

Science of Parallel Programming i 3 i 1 i 2 2 A i 4

Science of Parallel Programming i 3 i 1 i 2 2 A i 4 i 5 B ……. . Seemingly unrelated algorithms Unifying abstractions Specialized models that exploit structure 35