SIAM Parallel Processing 2006 Feb 22 Mini Symposium
SIAM Parallel Processing’ 2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing • Adaptive, hybrids, oblivious : what do those terms mean ? • Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting • Objective: towards an analysis based on the algorithm performance • 9 h 45 • 10 h 15 • 10 h 45 • 11 h 15 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France Hybrids in exact linear algebra Dave Saunders U. Delaware, USA Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, Gudula Runger, U. Bayreuth, Germany Cache-Oblivious algorithms Michael Bender, Stony Brook U. , USA
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
Why adaptive algorithms and how? Resources availability is versatile Measures on resources Input data vary Measures on data Adaptation to improve performances Scheduling Choices in the algorithm • partitioning • load-balancing • work-stealing • sequential / parallel(s) • approximated / exact • in memory / out of core • … Calibration • tuning parameters block size/ cache choice of instructions, … • priority managing An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem
Modeling an hybrid algorithm • Several algorithms to solve a same problem f : – Eg : algo_f 1, algo_f 2(block size), … algo_fk : – each algo_fk being recursive Adaptation to choose algo_fj for each call to f . algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … } E. g. “practical” hybrids: • Atlas, Goto, FFPack • FFTW • cache-oblivious B-tree • any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl, TLib…
• How to manage overhead due to choices ? • Classification 1/2 : – Simple hybrid iff O(1) choices [eg block size in Atlas, …] – Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] • choices are either dynamic or pre-computed based on input properties.
• Choices may or may not be based on architecture parameters. • Classification 2/2. : an hybrid is – Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [Bender] – Tuned : strategic choices are based on static parameters [eg block size w. r. t cache, granularity, ] • Engineered tuned or self tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [Lin. Box/FFLAS] [ Saunders&al] – Adaptive : self-configuration of the algorithm, dynamlc • Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]
Examples • BLAS libraries – Atlas: simple tuned (self-tuned) – Goto : simple engineered (engineered tuned) – Lin. Box / FFLAS : simple self-tuned, adaptive [Saunders&al] • FFTW – Halving factor : baroque tuned – Stopping criterion : simple tuned • Parallel algorithm and scheduling : – Choice of parallel degree : eg Tlib [Rauber&Rünger] – Work-stealing schedile : baroque hybrid
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
Work-stealing (1/2) « Work » W 1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Workstealing = “greedy” schedule but distributed and randomized • • Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen)
Work-stealing (2/2) « Work » W 1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin 02] -> with good probability, near-optimal schedule on p processors with average speeds ave Tp < W 1/(p ave) + O ( W / ave ) NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] • Implementation: work-first principle [Cilk, Kaapi] • • Local parallelism is implemented by sequential function call Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi
Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances • • Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, …) • • “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms • • Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …] a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: • • Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations
But often parallelism has a cost ! • Solution: to mix both a sequential and a parallel algorithm • Basic technique : • Parallel algorithm until a certain « grain » ; then use the sequential one • Problem : W increases also, the number of migration … and the inefficiency ; o( • Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja 92] Careful interplay of both algorithms to build one with both W small and W 1 = O( Wseq ) • Divide the sequential algorithm into block • Each block is computed with the (non-optimal) parallel algorithm • Drawback : sequential at coarse grain and parallel at fine grain ; o( • Adaptive granularity : dual approach : • Parallelism is extracted at run-time from any sequential task
Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : • - 1 sequential : Seq. Compute - 1 parallel : Last. Part. Computation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm Seq. Compute Extract_par Last. Part. Computation – Examples : - iterated product [Vernizzi 05] - MPEG-4 / H 264 [Bernard 06] - gzip / compression [Kerfali 04] - prefix computation [Traore 06]
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
Prefix computation : an example where parallelism always costs 1 = a 0*a 1 2=a 0*a 1*a 2 … n=a 0*a 1*…*an • Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; W = n 1 • Parallel algorithm [Ladner-Fischer]: a 0 a 1 a 2 a 3 a 4 … an-1 an * * Prefix of size n/2 1 3 … n * * * 2 4 … n-1 W =2. log n but W 1 = 2. n Twice more expensive than the sequential …
Adaptive prefix computation – Any (parallel) prefix performs at least W 1 2. n - W ops – Strict-lower bound on p identical processors: Tp 2 n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : – One process performs the main “sequential” computation – Other work-stealer processes computes parallel « segmented » prefix –Near-optimal performance on processors with changing speeds : Tp < 2 n/((p+1). ave) + O ( log n / ave) lower bound
Adaptive Prefix on 3 processors Main Seq. 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 1 t s ue q re l tea S Workstealer 1 Workstealer 2
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Main Seq. 1 2 3 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Workstealer 1 t s ue q re l a Ste Workstealer 2 6 7 i=a 5*…*ai
Adaptive Prefix on 3 processors Main Seq. 0 a 1 a 2 a 3 a 4 1 2 3 4 Preempt 8 4 8 a 5 a 6 a 7 a 8 Workstealer 1 6 7 8 i=a 5*…*ai a 9 a 10 a 11 a 12 Workstealer 2 10 i=a 9*…*ai
Adaptive Prefix on 3 processors Main Seq. 0 a 1 a 2 a 3 a 4 8 1 2 3 4 11 8 Preempt 8 11 a 5 a 6 a 7 a 8 Workstealer 1 5 6 7 8 i=a 5*…*ai a 9 a 10 a 11 a 12 Workstealer 2 9 10 11 i=a 9*…*ai
Adaptive Prefix on 3 processors Main Seq. 0 a 1 a 2 a 3 a 4 8 11 a 12 1 2 3 4 8 11 12 a 5 a 6 a 7 a 8 Workstealer 1 5 6 7 8 i=a 5*…*ai a 9 a 10 a 11 a 12 Workstealer 2 9 10 11 i=a 9*…*ai
Adaptive Prefix on 3 processors Main Seq. 0 a 1 a 2 a 3 a 4 8 11 a 12 1 2 3 4 8 11 12 Implicit critical path on the sequential process a 5 a 6 a 7 a 8 Workstealer 1 5 6 7 8 i=a 5*…*ai a 9 a 10 a 11 a 12 Workstealer 2 9 10 11 i=a 9*…*ai
Adaptive prefix : some experiments Join work with Daouda Traore Time (s) Prefix of 10000 elements on a SMP 8 procs (IA 64 / linux) Parallel Adaptive External charge Parallel Adaptive #processors Single user context Adaptive is equivalent to: - sequential on 1 proc - optimal parallel-2 proc. on 2 processors -… - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm
The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first
Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance
Mini Symposium Adaptive Algorithms for Scientific computing • Adaptive, hybrids, oblivious : what do those terms mean ? • Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting • Objective: towards an analysis based on the algorithm performance • 9 h 45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France • 10 h 15 Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA • 10 h 45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany • 11 h 15 Cache-Obloivious algorithms Michael Bender, Stony Brook U. , USA
Questions ?
- Slides: 27