DAG Scheduling with Reliability Grid Solve Fault Tolerance

- DAG Scheduling with Reliability - Grid. Solve - Fault Tolerance In Open MPI Asim Yar. Khan, Zhiao Shi, Jack Dongarra VGr. ADS Workshop April 2007

Task Graph Scheduling with Reliability • • Scheduling task graphs in a heterogeneous system is proven to be NP-hard With the increasing size of the system, the reliability of the system should be addressed – Application Failure prevention – Task duplication – Checkpointing – Application Failure avoidance – Guarantee that the probability of application failure is kept as low as possible • Jack Dongarra, Emannuel Jeannot, Erik Saule, Zhiao Shi, Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems, Published in Proceedings of 19 th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '07)

Bi-objective Optimization Problem • Task graph G=(V, E), a DAG • Processor pj P, |P| =m, • Cjmax is the completion time on processor j • Goal: minimizing the makespan and maximizing psucc. – Task vi V has oi operations and |V|= n – ei = e(i, j) E has associated li (communication cost from vi to vj) – j is the unit execution time on pj – j is the failure rate of pj, constant failure model – Success rate of task i (with oi ops) on processor j is Failure model follows exponential law The success rate of the application is – conflicting objectives – optimizing one objective will adversely affect the other

Generalizing Scheduling Heuristics • HEFT (Heterogeneous Earliest Finish Time) • RHEFT (Reliable HEFT) to minimize failure costs – Well known task graph scheduling heuristic – At each scheduling step, assign task to processors using Earliest Finish Time (EFT) – At each step assign task i to to processor j that has the minimum product Tendj * j – Easily extended to other heuristics – Tend 1 = 6; – Tend 2 = 9; Tend 1 * 1 = 12 Tend 2 * 2 = 9

Reliability/Makespan Tradeoff • Choose a subset of processors (by some ordering) – Choose k out of 100 processors ordering the processors: reliable first -> high makespan for low num proc ordering the processors: fastest first -> increasing reliability as num proc increases to 15

Reliability/Makespan Tradeoff • Tradeoff between HEFT and RHEFT ordering the processors: smallest first -> makespan OK and reliability decreases from optimal at 1 Tradeoff variable at k = 100 =0 RHEFT =1 HEFT

Task Graphs: Conclusion • Studied problem of optimizing DAGs with 2 objectives • Created a simple way to generalize heuristics (e. g. HEFT to RHEFT) to allow for reliability • – Minimize makespan – Maximize reliability Characterized the role of as a way to allow user to tradeoff reliability and makespan by choosing a subset of processors

Grid. Solve • Grid based, client-agentserver, RPC system – Resource discovery, dynamic problem solving capabilities, load balancing, fault tolerance, asynchronous calls, disconnected operation, network data storage, . . . – Dynamic service bindings – Client does not need to have stubs for services – API provides a variety of methods – Blocking, non-blocking, task farms, disconnected, . . . – Easy to additional services by wrapping libraries • Prime goal: ease-of-use – Supports proposed standard Grid. RPC client API – Matlab, Octave, IDL: enable desktop based Grid computing

Grid. Solve: Scheduling • • Joint work with Emmanuel Jeannot, INRIA – History based performance estimation – Uses historical performance of specific problem on a specific server with a template based linear model to make more accurate service performance model – Communication cost estimates – Client estimates communication costs for a subset of servers via a simple probe – Perturbation model for scheduling – The agent uses a interaction model of the currently executing jobs on the servers to schedule jobs (includes estimated completion times) – Accounts for the effect of one job on another Emmanuel Jeannot and Keith Seymour and Asim Yar. Khan and Jack J. Dongarra, Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware, Parallel Processing Letters, 2007

Grid. Solve: Workflow support • • • Work in progress, with Alexey Lastovetsky (UC Dublin) et al Grid. Solve is being enhanced to support workflow – Described by a DAG submitted at the client – DAG parsed by client into Grid. Solve calls – Data transferred from server to server Scheduling is currently static – Done at DAG submission time, uses agent knowledge to create a smart mapping of the services to the resources

Grid. Solve: Summary • Beta versions of Grid. Solve have been released – Currently v 0. 15 – New version with scheduling options to be released soon • Integration Possibility – As in Net. Solve/Gr. ADS – vg. ES can act as a resource manager for the Grid. Solve agent – Software and data are migrated on demand Gr. ADS/Net. Solve integration for SC 05

Open MPI • • • Open source (BSD style) MPI -2 implementation (v 1. 2) Thread safe Dynamic process spawning Uses multiple network interfaces simultaneously via a single library Tunable (e. g. collective operations can be tuned to platform) In development: Network and process fault tolerance

Open MPI: Fault Tolerance • Notes: blue=not maintained; white=still functional; red=maintained and to be incorporated in Open MPI

Open MPI: Fault Tolerance • Open MPI plans on supporting the following fault tolerance techniques: – LAM/MPI style automatic, coordinated process checkpoint and restart (expected SC 07) – MPICH-V style automatic, uncoordinated process checkpoint with message logging techniques (expected SC 07) – User directed, and communicator driven fault tolerance. Similar to those implemented in FT-MPI (expected 2008)

The End