The EMAN Application An Update EMAN Oversimplified Refine

EMAN Oversimplified Refine Electron Micrographs Preliminary 3 D model Particles Final 3 D model

Start Parallel component proc 3 d Seq. component volume make 3 diter EMAN Refinement

Recent non-EMAN Experiments • Montage workflows show value of scheduling (joint with Eva Deelman

Recent EMAN Experiments • • Experiments with Gro. EL data set (“small”, 200 MB

Where do we go from here? • Scalability • Better heuristics • Incorporating queueing

Heuristic Workflow Scheduling: Results • • By using heuristic workflow scheduling, workflow completion times

Heuristic Workflow Scheduling: Results • • Simulation results for workflow completion times for different

Results with Gro. EL data • • EMAN data-set: 200 MB input file replicated

Results: Unloaded Resources Heuristic_run 1 Heuristic_run 2 Random_run 1 Random_run 2 12 m 45

Results: Loaded Resources Exectime(uh) • • Heuristic_run Random_run 16 m 41 s [60] 9

Results: Inaccurate Performance Models Exectime(uh) Heuristic_run 1 Heuristic_run 2 Random_run 1 Random_run 2 21

Results with rdv data • • Successfully ran refinement cycle for rdv data using

Results: rdv data with unloaded resources Component Name Resource(s) Chosen # Output instances directory

Slides: 15

Download presentation

The EMAN Application: An Update

EMAN Oversimplified Refine Electron Micrographs Preliminary 3 D model Particles Final 3 D model Preliminary 3 D Model

Start Parallel component proc 3 d Seq. component volume make 3 diter EMAN Refinement Process make 3 diter project 3 d proc 2 d make 3 d classalign 2 classesbymra

Recent non-EMAN Experiments • Montage workflows show value of scheduling (joint with Eva Deelman et al. )

Recent EMAN Experiments • • Experiments with Gro. EL data set (“small”, 200 MB input file replicated on all clusters) — All different grid configurations Conclusion: Significant advantage to using performance estimates, if they are accurate — Significant disadvantage to being “too smart”

Where do we go from here? • Scalability • Better heuristics • Incorporating queueing systems — Putting together a big testbed for SC 04 – ~80 nodes at UH (IA-32 + IA-64), ~60 nodes at Rice (IA-64), ~4 nodes at Baylor College of Medicine (IA-32) — Run a big calculation (with real data? ) — Problem: IP addresses can’t be permanent (at Rice) — Scheduler improvements under design (Anirban) — Performance prediction under load needed (Gabi? ) — Schedule a cluster as a cluster — Needs model of queue delay

Backup slides after this

Heuristic Workflow Scheduling: Results • • By using heuristic workflow scheduling, workflow completion times improve by an order of magnitude[>20 times] over random scheduling for heterogeneous platform Workflow completion time is within 10% of that using a very expensive AI scheduler that doesn’t scale to 2047 jobs

Heuristic Workflow Scheduling: Results • • Simulation results for workflow completion times for different “Montage” workflows Improvement of >20% for homogeneous platform Preliminary results from joint work with Ewa Deelman et. al. at USC ISI

Results with Gro. EL data • • EMAN data-set: 200 MB input file replicated on all clusters • • Used new performance models • • Ran the refinement cycle for the Gro. EL data using the new version of EMAN Testbed: 6 nodes [i 2 -53 to 58] on the Uo. H mckinley cluster and 7 nodes [torc 1 - torc 7] on the Utk torc cluster Analyzed results of makespan length for the most computationally intensive step in the workflow : classesbymra [clsbymra] Compared heuristic scheduling using performance models with random scheduling

Results: Unloaded Resources Heuristic_run 1 Heuristic_run 2 Random_run 1 Random_run 2 12 m 45 s [38] 12 m 40 s [38] 5 m 22 s [15] 6 m 44 s [19] Exectime(utk) 11 m 50 s [60] 11 m 45 s [60] 15 m 39 s [83] 15 m 58 s [79] Makespan 12 m 40 s 15 m 39 s Exectime(uh) • • 12 m 45 s 15 m 58 s Used 2 nodes [i 2 -55 to 56] on mckinley cluster and 7 nodes [torc 1 torc 7] on the torc cluster The number in the braces after execution times indicate the number of clsbymra instances mapped to the site rank[clsbymra_i][UH_mc_j]=7. 60; rank[clsbymra_i][Utk_mc_j]=16. 37 Conclusion: Very accurate relative performance models on different heterogeneous platforms combined with heuristic scheduling result in near optimal load balance of the clsbymra instances when the grid resources are relatively unloaded [dedicated resources]

Results: Loaded Resources Exectime(uh) • • Heuristic_run Random_run 16 m 41 s [60] 9 m 38 s [44] Exectime(utk) 7 m 51 s [38] 10 m 28 s [54] Makespan 10 m 28 s 16 m 41 s Used 5 nodes [i 2 -54 to 57] on mckinley cluster and 7 nodes [torc 1 torc 7] on the torc cluster; i 2 -54 to 57 were highly loaded; torcs not loaded rank[clsbymra_i][UH_mc_j]=7. 60; rank[clsbymra_i][Utk_mc_j]=16. 37 Uneven load balance due to loading of the uh machines Conclusion: Performance model based scheduling works only when the underlying set of resources are reliable [or advanced reserved]. NWS predictions may not be enough.

Results: Inaccurate Performance Models Exectime(uh) Heuristic_run 1 Heuristic_run 2 Random_run 1 Random_run 2 21 m 10 s [77] 21 m 29 s [77] 6 m 5 s [49] 4 m 44 s [41] Exectime(utk) 3 m 54 s [21] 3 m 55 s [21] 9 m 13 s [49] 11 m 51 s [57] Makespan 21 m 29 s 9 m 13 s 11 m 51 s • • • 21 m 10 s Used 6 nodes [i 2 -53 to 58] on mckinley cluster and 7 nodes [torc 1 torc 7] on the torc cluster; torcs not loaded; Uo. H machines moderately loaded rank[clsbymra_i][UH_mc_j]=4. 57; rank[clsbymra_i][Utk_mc_j]=16. 37 Performance model for Uo. H machines way off. Conclusion: Inaccurate relative performance models on different heterogeneous platforms result in poor load balance of the clsbymra instances. [Note that the numbers here reflect loss of performance due to both inaccurate perfromance models and moderate load on Uo. H]

Results with rdv data • • Successfully ran refinement cycle for rdv data using EMAN version 1. 6 using the Gr. ADSoft code base Medium/large data-set: 2 GB input file replicated on all clusters New performance models for the components Testbed: — Six nodes [i 2 -53 to i 2 -58] at the mckinley cluster at University of Houston - IA 64 — Seven single processor nodes [torc 1 to torc 7] in the torc cluster at University of Tennessee, Knoxville - IA 32

Results: rdv data with unloaded resources Component Name Resource(s) Chosen # Output instances directory Component Exec. Time proc 3 d project 3 d proc 2 d classesbymra i 2 -58 i 2 -53 to 58 1 1 1 68 [i 2 -*] Gr. ADS_27111 Gr. ADS_31914 Gr. ADS_5765 Gr. ADS_9850 <1 min 1 h. 48 min <1 min 84 h. 30 min classalign 2 make 3 d proc 3 d torc 1 -7 i 2 -53 to 58 i 2 -58 42 [torc*] 379 1 1 1 Gr. ADS_9849 Gr. ADS_27496 Gr. ADS_16325 Gr. ADS_27520 Gr. ADS_13198 81 45 47 <1 <1 • h. 41 min min min Accurate relative performance models on different heterogeneous platforms combined with heuristic scheduling result in optimal load balance of the classesbymra instances when the resources are unloaded