Resource Specification Prediction Model Richard Huang ryhuangcs ucsd























- Slides: 23
Resource Specification Prediction Model Richard Huang ryhuang@cs. ucsd. edu joint work with Henri Casanova and Andrew Chien
Introduction § Advances in networking technology § 10 -40 Gbps aggregate bandwidth § Optical fibres § More resources mean bigger problems can be solved § Increasing deployment of clusters § Decreasing hardware prices § More choices in cluster vendors § Increasing availability of open source cluster management tools (e. g. ROCKS) § Large-scale distributed environments (LSDEs) can be used to run large-scale loosely synchronous apps such as scientific workflows
Running Applications on LSDEs • One key challenge in running scientific workflows is resource selection
What’s the problem? • • • Different resource selection systems are out there (such as VGES) How does one go about writing the resource specification? We don’t know any other work that address this problem.
Solution depends on… • • • Application (DAG) characteristics Type of scheduling algorithm employed Types of resources available
Assumptions • Resources are plentiful • • Resources are dedicated or have advanced reservation OR • • • Bandwidth is reasonably plentiful — Therefore we can pick the right size RC Underlying middleware (such as VGES) can take care of interfacing with batch queue systems. We don’t deal with network contention Performance models for applications so we know task runtimes
Resource Specification Prediction Model § Empirical Model uses input of DAG characteristics and optional utility function § Heuristic Prediction Model predicts the best scheduling heuristic to use § Size Prediction Model predicts the optimal resource collection (RC) size
Strategy in Formulating Prediction Model 1. Determine relevant DAG characteristics 2. Define what the best RC should be 3. Execute reference scheduling heuristic on an observation set of DAG configurations while varying relevant DAG characteristics 4. Derive model from the observation set results that predicts the best RC size
Relevant DAG Characteristics • • DAG size • Other possible characteristics: Communication-to-computation ratio (CCR) Amount of parallelism Regularity among tasks from different DAG levels — DAG height and average number of tasks per level (subsumed by above characteristics)
Define best RC size • • Take an application and run it on different number of hosts Best RC size is where increasing the number of hosts does not improve performance (knee value)
Observation Set DAG Characteristics Values DAG size 100, 500, 1000, 5000, 10000 CCR 0. 01, 0. 3, 0. 5, 0. 8, 1. 0 Parallelism 0. 3, 0. 4, 0. 5, 0. 6, 0. 7, 0. 8, 0. 9 Regularity 0. 01, 0. 3, 0. 5, 0. 8, 1. 0 • Instantiate 10 random DAGs for each DAG configuration • Vary number of tasks per level randomly while maintaining parallelism, regularity, and size • Idea is to run scheduling heuristics on each DAG configuration and try to see if we can detect some trends
Size Prediction Model Formulation • • For better tractability, at first consider only parallelism ( ) and regularity ( ) for each (size, CCR) pair Knee size seems to approximately double for every 0. 1 increase in Knee size decreases with increase in regularity Based on tables similar to one on right, hypothesize prediction could be modeled as 2(a +b +c) 0. 01 0. 3 0. 5 0. 8 1. 0 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 34 32 22 18 14 14 52 36 28 24 22 20 80 62 58 50 56 42 136 140 128 112 94 128 312 280 248 212 196 464 456 448 448 432 496 440 432 392 Sample observation set knee values
Size Prediction Model Formulation • • • We need to solve for a, b, and c for each (size CCR) pair Use linear regression to do planar fit of logarithm of knee value Interpolation between different DAG sizes and CCR values
Model Validation • Two workloads • Performance Metric: Application Turn-Around time (scheduling time + makespan) • • — randomly generated DAGs (range of different DAG characteristics) — Montage DAGs Cost: Derived cost from Amazon’s Elastic Cloud of $0. 10 per hour for a 1. 7 GHz processor Use brute force method to calculate optimal size
Randomly generated DAGs Observ. Average Set DAG Predicted Perform Sizes Size Diff. Degrad. Relative Cost 100 9. 59% 0. 18% -6. 75% 500 11. 49% 0. 22% -5. 29% 1000 9. 62% 0. 32% -4. 32% 5000 13. 27% 0. 77% -4. 72% 300 13. 41% 0. 34% -11. 31% 750 11. 85% 0. 29% -5. 59% 3000 14. 97% 1. 08% -9. 98% Midpoint DAG sizes • Peak performance degradation was 15%, but on average, most were below 1% • Prediction model predicted smaller RC sizes (reduces costs)
Comparison with using DAG width Observ. Set DAG Sizes Average Perform Degrad. Relative Cost • 100 0. 50% 144. 8% 500 0. 20% 425. 7% • 1000 0. 45% 562. 9% 5000 22. 66% 998. 1% 300 0. 30% 219. 2% 750 0. 26% 503. 0% 3000 6. 80% 759. 8% Midpoint DAG sizes • Using DAG width to try to maximize parallelism costs a lot more! Similar performance degradation as our model for smaller DAGs As DAG size increases, performce degrades rapidly as scheduling times becomes larger because of larger RC sizes
Performance Cost Tradeoff • • • User can specify optional utility function For example 1% performance degradation for every 10% in cost Knee threshold of 2% provides best utility in this example
Montage • Thresh. 0. 1% 0. 5% 1. 0% 2. 0% 5. 0% 10. 0% 1629 -tasks 4469 -tasks Perf. Degr. Relative Costs 0. 08% 11. 2% 0. 04% 7. 5% 0. 01% 0. 6% 0. 89% -13. 5% 0. 75% -30. 8% 4. 18% -48. 2% Relative Costs 0. 00% -2. 4% 0. 00% -4. 0% 1. 35% -21. 2% 1. 81% -30. 4% 4. 67% -51. 0% • Different thresholds did not degrade performance too much Better savings at higher thresholds (using fewer hosts)
Sensitivity Analysis • • Previous results all based on homogeneous clock rates and reference scheduling heuristic Should look at how model reacts to: — Different levels of clock rate heterogeneity — Different scheduling heuristics
Impact of Clock Rate Heterogeneity
Impact of Scheduling Heuristics
Summary • • We devised empirical model to predict good RC size • Our model maintains good performance over range of DAG configurations, range of resource heterogeneity, and over different scheduling heurisitics We have shown that our model leads to good application performance at often reduced costs from optimal RC size
Future Work • • • Heuristic Prediction Model to predict which heuristic to use given input DAG and optional utility function How do we degrade gracefully when the resource selection system cannot return the desired resource collection Translate output of our model into input to different resource selection systems