Frdric Gava BulkSynchronous Parallel ML Implementation of the
Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the Parallel Superposition
Background Parallel programming Implicit Automatic parallelization BSML skeletons Data-parallelism Explicit Parallel extensions Concurrent programming 2
Projects 2002 -2004 ACI Grid LIFO, LACL, PPS, INRIA Design of parallel and Grid librairies for OCaml. 2004 -2007 ACI « Young researchers » LIFO, LACL Production of a programming environment in which certified parallel programs can be written and safely executed. 3
Outline I. III. IV. The BSML language Multi-programming (superposition) Implementation of the superposition Conclusion and future works 4
The BSML language 5
The BSML « spirite » Bugs grow faster than Moore’s law. (G. Berry) Ø High-level language lines of code number of bugd Ø Certified library number of bugs Small is beautiful. (R. H. Bisseling) Ø BSML only use 5 primitives… Who would drive a non-deterministic car ? (G. Berry) Ø Propriety of confluence of the semantic of BSML French Proverb : « All the roads go to Roma » But the better way is to choose the shorter Ø One can give BSP costs to BSML programs Ø Different of concurrent programming : cost and confluence 6
The BSP model BSP architecture: P/M Unit of synchronization P/M P/M Network Characterized by: § § p r L g Number of processors Processors speed Global synchronization Phase of communication (1 word at most sent of received by each processor) 7
Model of execution Global (collective) communications between processors Global synchronization : exchanged data available for the next super-step Cost(i) = (max 0 x<p wxi) + hi g + L Super-step i+1 Local computing on each processor Super-step i Beginning of the super-step i wi g hi L wi+1 g hi+1 L 8
Example : broadcast Direct broadcast (one super-step): BSP cost = p n g + L Broadcast with 2 super-steps: BSP cost = 2 n g + 2 L 9
The BSML language -calculus ML Parallel primitives Parallel constructions BS -calculus BSML Structured parallelism as an explicit parallel extension of ML Functional language with BSP cost predictions Allows the implementation of skeletons Implemented as a parallel library for the "Objective Caml" language Using a parallel data structure called parallel vector 10
A BSML program Replicated part f 0 f 1 … fp-1 g 0 g 1 … gp-1 Parallel part Sequential part 11
Parallel primitives of BSML Asynchronous primitives: § Creation of a vector (creation of local values) mkpar : (int ) par § Parallel point-wize application apply : ( ) par par Synchronous and communications primitives: § Communications put : (int ) par § Projection of local values (to be replicated) proj : par (int ) 12
Semantics Natural semantics Programming model Easy for proofs (Coq) Small-steps semantics Easy for costs Distributed semantics Execution model Make asynchronous steps appear Close to a real implemantation 13
Natural semantics • Semantics = set of axioms and inference rules • Easy to understand, makes proofs more easy • Example: 14
Small steps semantics • Semantics = set of rewriting rules Global cost • Using contexts for the strategy • Easier understanding of costs and errors • Example: Local costs 15
Distributed semantics Semantics = set of parallel rewriting rules SPMD style: Parallel vector Parts of the parallel vector Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com =. . . (* send wm to processes m+1…p+1 *) let op’ =. . . (* applies op to wm and wi, m<i<p *) in parfun 2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid Natural Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com =. . . (* send wm to processes m+1…p+1 *) let op’ =. . . (* applies op to wm and wi, m<i<p *) in parfun 2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com =. . . (* send wm to processes m+1…p+1 *) let op’ =. . . (* applies op to wm and wi, m<i<p *) in parfun 2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid Distributed evaluation 16
Multi-programming 17
Parallel composition Several programs on the same machine Primitive of parallel composition: Superposition Divide-and-conquer BSP algorithms 18
Parallel Superposition super : (unit ) super E 1 E 2 (E 1 (), E 2()) Fusion of communications/synchronisations using super-threads Keep the BSP model Pure functional semantics 19
Parallel Superposition 20
Implementation of the superposition 21
Semantics (1) Natural semantics : Small-step semantics: Solution, the super-threads : 22
Semantics (2) Management of the communications : Management of the superposition : 23
Semantics based implementation The semantics makes appear 3 low level primitives : Ø Send to send the data of the environment of communication Ø Rcv to received them Ø Wait to allow a super-thread to wait his brother BSML primitives are thus simple calls of them (as in the small-steps semantics) Super-threads could be implemented using threads A scheduler of this threads is thus need for the special management of our super-threads The environment of communications is just a Hashtable with pid of super-threads as keys 24
Example, prefixes calculus scan : ( ) par scan (+) <v 0, …, vp-1> = <v 0, v 0+v 1, …, v 0+v 1+…+ vp-1> scan (+) <v 0, …, vm, …> = < w 0 , … , w m , … > scan (+) <… , vm+1, …, vp-1> =<…, wm+1 , … , wp+1> < w 0 , … , w m , wm+wm+1, … , wm+wp+1> = <v 0, v 0+v 1, v 0+…+vm+1, …, v 0+…+vp-1> 25
Benchmarks Time (s) Direct method (BSML+MPI) D-a-C method with superposition D-a-C method with juxtaposition Size of the polynomials 26
Conclusion and future works 27
Conclusion BSML=BSP+ML Superposition = primitive of parallel composition Small-step semantics of the superposition Distributed semantics as small one Superposition implemented using threads as in the smallstep semantics 28
Future works Implementation using continuation (transformation of source’s code with the help of a type checker) and proof of equivalence using our semantics Implentation of bigger algorithms for better benchmarks of BSML and its superposition Implementation of parallel skeletons (management of tasks) using the superposition ? BSP model-checking of high-level Petri-nets (M-nets). The main difficult : find a non-trivial algorithm as the community of concurrent programming does. Possible but need more theoretical optimisations… 29
Thanks for your attention 30
- Slides: 30