Resource Specification Prediction Model Richard Huang ryhuangcs ucsd

Introduction § Advances in networking technology § 10 -40 Gbps aggregate bandwidth § Optical

Running Applications on LSDEs • One key challenge in running scientific workflows is resource

What’s the problem? • • • Different resource selection systems are out there (such

Solution depends on… • • • Application (DAG) characteristics Type of scheduling algorithm employed

Assumptions • Resources are plentiful • • Resources are dedicated or have advanced reservation

Resource Specification Prediction Model § Empirical Model uses input of DAG characteristics and optional

Strategy in Formulating Prediction Model 1. Determine relevant DAG characteristics 2. Define what the

Relevant DAG Characteristics • • DAG size • Other possible characteristics: Communication-to-computation ratio (CCR)

Define best RC size • • Take an application and run it on different

Observation Set DAG Characteristics Values DAG size 100, 500, 1000, 5000, 10000 CCR 0.

Size Prediction Model Formulation • • For better tractability, at first consider only parallelism

Size Prediction Model Formulation • • • We need to solve for a, b,

Model Validation • Two workloads • Performance Metric: Application Turn-Around time (scheduling time +

Randomly generated DAGs Observ. Average Set DAG Predicted Perform Sizes Size Diff. Degrad. Relative

Comparison with using DAG width Observ. Set DAG Sizes Average Perform Degrad. Relative Cost

Performance Cost Tradeoff • • • User can specify optional utility function For example

Montage • Thresh. 0. 1% 0. 5% 1. 0% 2. 0% 5. 0% 10.

Sensitivity Analysis • • Previous results all based on homogeneous clock rates and reference

Summary • • We devised empirical model to predict good RC size • Our

Future Work • • • Heuristic Prediction Model to predict which heuristic to use

Slides: 23

Download presentation

Resource Specification Prediction Model Richard Huang ryhuang@cs. ucsd. edu joint work with Henri Casanova and Andrew Chien

Introduction § Advances in networking technology § 10 -40 Gbps aggregate bandwidth § Optical fibres § More resources mean bigger problems can be solved § Increasing deployment of clusters § Decreasing hardware prices § More choices in cluster vendors § Increasing availability of open source cluster management tools (e. g. ROCKS) § Large-scale distributed environments (LSDEs) can be used to run large-scale loosely synchronous apps such as scientific workflows

Running Applications on LSDEs • One key challenge in running scientific workflows is resource selection

What’s the problem? • • • Different resource selection systems are out there (such as VGES) How does one go about writing the resource specification? We don’t know any other work that address this problem.

Solution depends on… • • • Application (DAG) characteristics Type of scheduling algorithm employed Types of resources available

Assumptions • Resources are plentiful • • Resources are dedicated or have advanced reservation OR • • • Bandwidth is reasonably plentiful — Therefore we can pick the right size RC Underlying middleware (such as VGES) can take care of interfacing with batch queue systems. We don’t deal with network contention Performance models for applications so we know task runtimes

Resource Specification Prediction Model § Empirical Model uses input of DAG characteristics and optional utility function § Heuristic Prediction Model predicts the best scheduling heuristic to use § Size Prediction Model predicts the optimal resource collection (RC) size

Strategy in Formulating Prediction Model 1. Determine relevant DAG characteristics 2. Define what the best RC should be 3. Execute reference scheduling heuristic on an observation set of DAG configurations while varying relevant DAG characteristics 4. Derive model from the observation set results that predicts the best RC size

Relevant DAG Characteristics • • DAG size • Other possible characteristics: Communication-to-computation ratio (CCR) Amount of parallelism Regularity among tasks from different DAG levels — DAG height and average number of tasks per level (subsumed by above characteristics)

Define best RC size • • Take an application and run it on different number of hosts Best RC size is where increasing the number of hosts does not improve performance (knee value)

Observation Set DAG Characteristics Values DAG size 100, 500, 1000, 5000, 10000 CCR 0. 01, 0. 3, 0. 5, 0. 8, 1. 0 Parallelism 0. 3, 0. 4, 0. 5, 0. 6, 0. 7, 0. 8, 0. 9 Regularity 0. 01, 0. 3, 0. 5, 0. 8, 1. 0 • Instantiate 10 random DAGs for each DAG configuration • Vary number of tasks per level randomly while maintaining parallelism, regularity, and size • Idea is to run scheduling heuristics on each DAG configuration and try to see if we can detect some trends

Size Prediction Model Formulation • • For better tractability, at first consider only parallelism ( ) and regularity ( ) for each (size, CCR) pair Knee size seems to approximately double for every 0. 1 increase in Knee size decreases with increase in regularity Based on tables similar to one on right, hypothesize prediction could be modeled as 2(a +b +c) 0. 01 0. 3 0. 5 0. 8 1. 0 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 34 32 22 18 14 14 52 36 28 24 22 20 80 62 58 50 56 42 136 140 128 112 94 128 312 280 248 212 196 464 456 448 448 432 496 440 432 392 Sample observation set knee values

Size Prediction Model Formulation • • • We need to solve for a, b, and c for each (size CCR) pair Use linear regression to do planar fit of logarithm of knee value Interpolation between different DAG sizes and CCR values

Model Validation • Two workloads • Performance Metric: Application Turn-Around time (scheduling time + makespan) • • — randomly generated DAGs (range of different DAG characteristics) — Montage DAGs Cost: Derived cost from Amazon’s Elastic Cloud of $0. 10 per hour for a 1. 7 GHz processor Use brute force method to calculate optimal size

Randomly generated DAGs Observ. Average Set DAG Predicted Perform Sizes Size Diff. Degrad. Relative Cost 100 9. 59% 0. 18% -6. 75% 500 11. 49% 0. 22% -5. 29% 1000 9. 62% 0. 32% -4. 32% 5000 13. 27% 0. 77% -4. 72% 300 13. 41% 0. 34% -11. 31% 750 11. 85% 0. 29% -5. 59% 3000 14. 97% 1. 08% -9. 98% Midpoint DAG sizes • Peak performance degradation was 15%, but on average, most were below 1% • Prediction model predicted smaller RC sizes (reduces costs)

Comparison with using DAG width Observ. Set DAG Sizes Average Perform Degrad. Relative Cost • 100 0. 50% 144. 8% 500 0. 20% 425. 7% • 1000 0. 45% 562. 9% 5000 22. 66% 998. 1% 300 0. 30% 219. 2% 750 0. 26% 503. 0% 3000 6. 80% 759. 8% Midpoint DAG sizes • Using DAG width to try to maximize parallelism costs a lot more! Similar performance degradation as our model for smaller DAGs As DAG size increases, performce degrades rapidly as scheduling times becomes larger because of larger RC sizes

Performance Cost Tradeoff • • • User can specify optional utility function For example 1% performance degradation for every 10% in cost Knee threshold of 2% provides best utility in this example

Montage • Thresh. 0. 1% 0. 5% 1. 0% 2. 0% 5. 0% 10. 0% 1629 -tasks 4469 -tasks Perf. Degr. Relative Costs 0. 08% 11. 2% 0. 04% 7. 5% 0. 01% 0. 6% 0. 89% -13. 5% 0. 75% -30. 8% 4. 18% -48. 2% Relative Costs 0. 00% -2. 4% 0. 00% -4. 0% 1. 35% -21. 2% 1. 81% -30. 4% 4. 67% -51. 0% • Different thresholds did not degrade performance too much Better savings at higher thresholds (using fewer hosts)

Sensitivity Analysis • • Previous results all based on homogeneous clock rates and reference scheduling heuristic Should look at how model reacts to: — Different levels of clock rate heterogeneity — Different scheduling heuristics

Impact of Clock Rate Heterogeneity

Impact of Scheduling Heuristics

Summary • • We devised empirical model to predict good RC size • Our model maintains good performance over range of DAG configurations, range of resource heterogeneity, and over different scheduling heurisitics We have shown that our model leads to good application performance at often reduced costs from optimal RC size

Future Work • • • Heuristic Prediction Model to predict which heuristic to use given input DAG and optional utility function How do we degrade gracefully when the resource selection system cannot return the desired resource collection Translate output of our model into input to different resource selection systems