Predicting Queue Waiting Time For Individual User Jobs
Predicting Queue Waiting Time For Individual User Jobs Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli, Ryan Garver Computer Science Department University of California, Santa Barbara
Problem: Predicting Delay in Batch Queues • • Time in queue is experienced as application delay • Much research in this area over the past 20 years, but few solutions Sounds like an easy problem, but — Distribution of load from users is a matter of some debate — Scheduling policy is partially hidden — Sites need to change the policies dynamically and without warning — Job execution times are difficult to predict — Current commercial systems provide high variance estimates – On-line simulation based on max requested time – “expected” value predictions — Most sites simply disable these features
Hard Problem
For Scheduling: It’s all about the big Q • Predictions of the form • Requires two estimates if certainty is to be quantified • — “What is the maximum time my job will wait with X% certainty? ” — “What is the minimum time my job will wait with X% certainty? ” — Estimate the (1 -X) quantile for the distribution of availability => Qx — Estimate the upper or lower X% confidence bound on the statistic Qx => Q(x, b) If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q(x, b) X% of the time, guaranteed
Quantiles versus Moments • Quantiles permit quantifiable predictions for individual jobs • Example: 100 jobs, weighty tail, 6 orders of magnitude variation, random order — “expectation” in relation to the mean is a misnomer => useful for throughput — 95 jobs wait 10 seconds or less — 1 job waits 1000 seconds — 1 job waits 10000000 seconds • mean wait time: 111120 seconds • 0. 95 quantile: 10 seconds — The “expected value” — “ 95% chance” job will wait 10 seconds or less
BMBP: A New Predictive Methodology • New quantile estimator invention based on Binomial distribution • Requires carefully engineered numerical system to deal with largescale combinatorics • New changepoint detector • Binomial method in a time series context is difficult • Need a system to determining • Stationary regions in the data • Minimum statistically meaningful history in each region • New clustering methodology • More accurate estimates are possible if predictions are made from jobs with similar characteristics • Takes dynamic policy changes into account more effectively
Ten Years of Supercomputing
See it In Action • http: //nws. cs. ucsb. edu/batchq
Predicting Things Upside Down • Deadline scheduling: My job needs to start in the next X seconds for the results to be meaningful. — Amitava Mujumdar, Tharaka Devaditha, Adam Birnbaum (SDSC) – Need to run a 4 minute image reconstruction that completes in the next 8 minutes • Given a • • What is the probability that a job will meet the deadline? — Machine — Queue — Processor count — Run time — Deadline http: //nws. cs. ucsb. edu/batchq/invbqueue. php
How Well Does it Work with an Application? Refine Electron Micrograph EMAN Preliminary 3 D model Final 3 D model Preliminary 3 D Model Particles EMAN has been developed at Baylor College of Medicine by Research group of Wah Chiu and Steven Ludtke {wah, sludtke}@bcm. tmc. edu
VGr. ADS EMAN Batch Scheduler • EMAN emulator • Experiment: • Results: mean observed and mean predicted makespans are not significantly different at alpha = 0. 05 — Run the EMAN scheduler to determine a job launch sequence — Launch the jobs by submitting them to the queues specified by the scheduler — When an EMAN job acquires the processors, exit and “sleep” the emulator for the predicted execution time – Saves system allocation time — Record the overall makespan — Chicago Tera. Grid, SDSC Tera. Grid, NCSA Tera. Grid and CNSI Dell at UCSB — 57 separate runs
95% Upper Bound on Median
EMAN Turnaround Improvement
Virtual Resource Reservations now • • • 75% is the target probability • • 192 slots successfully acquired 0. 75 submit time 356 total requests 257 total batch submissions — 99 requests resulted in initial ‘not possible’ response 257 *. 75 = 193
Clustering • RMS ratio of BMBP with Clustering to without — Both achieve 95% correctness — Measures additional “tightness” improvement through clustering
FAQ • What happens if everyone uses these predictions? Will it be stable? — Maybe — We do not consider jobs in queue — Automatic schedulers may cause destabilization • What about autocorrelation (you idiot)? • Not a guarantee since it can fail — Difficult to compute in this space – Error-prone for non-stationary series – Queues reorder the series — Autocorrelation is and is not an issue – Quantile estimation and clustering algorithm are relatively robust to autocorrelation – Change-point detector computes uses the autocorrelation it computes on the fly — All guarantees come with a failure probability
The Software • Requires no special privileges • Version 1 -- obsolete • Version 2 -- beta version • Version 3 -- end of the year — Predictions are better and “burn-in” shorter if scheduler logs are available => retrofit the log history — NWS sensors run at each site — Prediction software runs at UCSB — Command-line tools and web page connect to UCSB — Stable, but does not support clustering — Supports automatic clustering — Prediction software can be run locally or at UCSB — Command-line tools locally or at UCSB — Web support at UCSB only — No packaging
Batch Queue Prediction for Grid Systems • A good point-valued prediction remains elusive • Grid users certainly can use bounds instead • Deployment and integration underway • Automatic schedulers are coming — “expectation” sounds attractive but is really a misnomer — Early job completion is okay, typically — Bounds give a good intuitive feel for which queue will be quickest — CDF Fermi. Lab working (barely) — Condor integration — UCLA Grid tools — EMAN doesn’t use ranges…it should — VGr. ADS is developing new schedulers (workflow) — NEESGrid and ISI are in development (workflow) — LEAD integration is underway (workflow) — Large-scale sensor network simulation
What’s Next? • Open questions: • Virtual resource reservations (VGr. ADS) • Thanks • rich@cs. ucsb. edu — Does the availability of predictions affect load? – Rolling out production tools now and we will be monitoring – Job cancellation does not affect results — If it does, will allocations be stable? – Grid economies — Reservations must be integrated — Conditional prediction and resubmission — Replicated submissions (boost success probability) — Virtual Cluster? ? — NSF SCI, NSF NGS, VGr. ADS, SDSC, TACC, NCSA, Argonne
- Slides: 19