Towards Provision of Quality of Service Guarantees in
- Slides: 31
Towards Provision of Quality of Service Guarantees in Job Scheduling Mohammad Islam P. Sadayappan Pavan Balaji D. K. Panda Computer Science and Engineering The Ohio State University
Job Schedulers Today § Publicly Usable Supercomputer Centers and Compute Clusters – Becoming increasingly common (OSC, SDSC, etc) – Allow people to run jobs and charge the user based on the job requirements – Job requirements include Number of CPUs, CPU time, memory, etc. – Jobs are submitted by the user with resource requirements § Scheduled to execute at a later time § Dedicated job scheduler schedules the submitted jobs (Ex: PBS, Maui, Silver) § Jobs provide an estimate of the run-time together with the job § Independent Parallel Job Scheduling Model – Dynamically arriving Independent Parallel Jobs – Resource mapping: Submitted Jobs to Resources – Viewed as a two dimensional chart: Processors vs. Time
Two Dimensional Scheduling Grid Processors Job Queue J 4 J 3 Running Jobs J 2 J 1 Time Current Time J 5 J 6
Job Scheduling Overview § Submit jobs have several associated parameters – Resources Required – Arrival Time (time with the job is submitted by the user) – Run-time estimate of the job (users need to provide a run-time estimate) § Queuing Order – FCFS (First Come First Served) – SJF (Shortest Job First) § Reservations are made for some of the jobs – Conservative backfilling model – EASY backfilling model
Backfilling Models Job Queue J 6 Reservation Violation J 5 J 6 Processors J 5 J 4 J 3 Running Jobs J 4 J 2 J 1 J 5 Time Current Time Conservative gives reservations to all jobs; EASY gives reservations to just one job
Guarantees in Service § Number of Techniques studied over the years – Backfilling (Ex: Conservative, EASY, No Guarantee) – Priority based scheduling § Differentiated service to different classes of jobs § Soft Real-time or Best Effort guarantees to the completion time § Hard Real-Time or “Deadline-based” scheduling – Allow Users to specify the deadline they desire – Cost model based on Resources Used AND Deadline Specified – Requires a deadline-based scheduling algorithm
Qo. S for Job Scheduling § Two Components in providing Qo. S – Job Scheduling Component § Admission Control – Can we meet the specified deadline? § Once admitted, cannot miss the specified deadline § Studied as a part of our previous work [islam 03: qops] – Cost Model Component § Based on Resources Used AND Deadline Specified § More urgent jobs are charged more § Role of user characteristics in the charging model – Tolerance of users to missed deadlines § We deal with this component of Qo. S in this paper [islam 03: qops] “Qo. PS: A Qo. S based scheme for Parallel Job Scheduling”, M. Islam, P. Balaji, P. Sadayappan and D. K. Panda. Published in JSSPP ’ 03 and LNCS ‘ 04.
Overview F Introduction and Motivation F Cost Model for Supercomputer Centers F Understanding User-Tolerance F Dealing with User-Tolerance F Artificial Slack F Kill-and-Restart F Experimental Results F Conclusions and Future Work
Cost Model in Supercomputer Centers § Current Cost Model based on Resources used – Cost independent of the response time of the job – User cannot request for a quicker service § Some schedulers allow differentiated service (Ex: NERSC) – Use high priority, normal and low priority queues – High Priority queue charges more; Low Priority queue charges less – No guarantees provided to the user about the response time – Idea of charging the user based on the service/priority: Still Relevant ! § Two components in the Cost Model – Resource Charge – Qo. S Charge
Cost Model Components § Resource Charge – Depends on the CPU, Memory, Disk Space, etc. – We only consider the CPU resource; extendable to other resources § Qo. S Charge – Depends on the urgency of the job – Depends on the difficulty in scheduling the job; based on two components § Current load in the system § Slowdown of the category to which the job belongs (Ex: Short-Wide) – Slowdown depicts the factor by which the job is delayed compared to its runtime Resource Charge = Processors x Run-Time Qo. S Charge = Category Slowdown / Requested Slowdown
Overview F Introduction and Motivation F Cost Model for Supercomputer Centers F Understanding User-Tolerance F Dealing with User-Tolerance F Artificial Slack F Kill-and-Restart F Experimental Results F Conclusions and Future Work
Understanding User Tolerance § If the requested deadline cannot be met? – Some users might submit another job or to another supercomputer center – Some users might re-submit the job with another deadline § What new deadline? Hit-and-Trial? – We modified Qo. PS to provide some feedback about the best possible deadline § We model user-tolerance based on a Tolerance Factor (TF) – Factor of extension in the requested deadline, the user can accept – If the (Earliest Possible Deadline) < (TF x Requested Deadline) § The user is assumed to accept the new deadline – Else § The user is assumed to reject the new deadline
Feedback based Qo. PS Algorithm § This is for the case when the user requested deadline cannot be met – Say this deadline is D 1 § When a new job arrives: – Try to place the new job without disturbing any other job – Calculate the completion time of the job – This is a definitely feasible deadline (D 2) § Use a binary search for N iterations to find a better deadline – For each iteration, run Qo. PS with the new deadline – If feasible, try a tighter deadline; else try a looser deadline
Impact of User Tolerance • Increase in Resource Charge: Intuitive • Decrease in Qo. S Charge; as the users get more tolerant, the center loses Qo. S money ! • This is because of accepting cheaper jobs which would have otherwise been rejected • Overall profit of the Supercomputer Center depends on the ratio of the two charges
Overview F Introduction and Motivation F Cost Model for Supercomputer Centers F Understanding User-Tolerance F Dealing with User-Tolerance F Artificial Slack F Kill-and-Restart F Experimental Results F Conclusions and Future Work
Dealing with User-Tolerance § User-Tolerance is not an always win situation! – Depends on the Qo. S charge to Resource charge ratio § If Qo. S charge were 0, user-tolerance is always beneficial § If Resource charge were 0, user-tolerance is never beneficial – We need some way to deal with this behavior of user-tolerance § Allows us to maximize profit depending on the Qo. S and Resource charges § We propose two approaches to counter user-tolerance – Providing Artificial Slack – Kill and Restart Mechanism
Artificial Slack § If a requested deadline cannot be met – Provide the best possible deadline to the job § Maximizes the revenue for the job § Creates a very tight schedule; difficulty in accepting later arriving jobs – Provide the job with an artificial slack § Loses out on the revenue for the job § The overall schedule is loose enough to accept later arriving jobs – Offered Deadline extended § Arrival Time + (Earliest Deadline – Arrival Time) x Slack Factor
Kill and Restart § Some supercomputer centers support a kill-and-restart mechanism § A running job can be killed and restarted: No permanent modifications § Some centers such as the OSC support such mechanisms § Our approach utilizing the kill-and-restart mechanism – If a jobs requested deadline cannot be met § Kill a running job and try scheduling this job § Verify that both the new job and the killed job can be scheduled § If they can be scheduled, accept the schedule – It is to be noted that this approach might result in wastage of resources
Overview F Introduction and Motivation F Cost Model for Supercomputer Centers F Understanding User-Tolerance F Dealing with User-Tolerance F Artificial Slack F Kill-and-Restart F Experimental Results F Conclusions and Future Work
Impact of Slack Factor (SF) • Counter Impact as User Tolerance ! • Drop of Resource charge; more smaller jobs accepted, force larger jobs to be rejected • Increase in Qo. S charge; additional slack forces significantly delayed jobs to be rejected
Impact of Kill-and-Restart (TF = 4. 0) • Kill-and-Restart loses out on resource charge due to wastage of resources • Gains on the Qo. S charge due to its ability to accept later arriving expensive jobs
Impact of Non-deadline Jobs (20% deadline) • Benefits of Kill-and-Restart are even higher • Benefits seen in both cost components
Overview F Introduction and Motivation F Cost Model for Supercomputer Centers F Understanding User-Tolerance F Dealing with User-Tolerance F Artificial Slack F Kill-and-Restart F Experimental Results F Conclusions and Future Work
Concluding Remarks § Considerable Research on the topic of Parallel Job Scheduling § The issue of provision of Qo. S has received little attention § In this paper, we extend our previous Qo. S based scheme, Qo. PS – A feedback mechanism to provide the best possible deadline – Study the impact of user-tolerance on overall revenues – Propose schemes to minimize the negative impacts of user-tolerance § Demonstrated capabilities and issues based on simulation studies
Future Work § Incorporate the Qo. PS scheme into Maui/Moab at the OSC § Study Qo. S aspects for multi-site schedulers § Clusters with heterogeneous systems § Extend the scheme to multiple resources § Study Qo. S aspects for network flow scheduling
Thank You ! http: //www. cse. ohio-state. edu/~saday {islammo, balaji, saday, panda}@cse. ohio-state. edu
Backup Slides
The Basic Qo. PS Algorithm § When a job (JN+1) arrives – If J 1, J 2, …, JN are already present and scheduled in that order – Place the job (JN+1) at the start of all jobs § Try scheduling the jobs in that order – If all jobs are able to meet their deadlines, Great ! Admit it ! – If some job fails, we have two options: – Option 1: § Consider the failed job as a critical job § Push the failed job to the start of the schedule and retry § ‘k’ number of such re-orderings of existing jobs are allowed § If (number of re-orderings > k) switch to option 2 – Option 2: § Back off exponentially in the position at which you try placing job (JN+1) and retry
Working of the Qo. PS Algorithm J 13 J 12 J 11 J 10 J 9 J 8 J 7 J 13 6 J 65 J 54 J 123 J 313 2 1 J 2313 1 J 42313 Current Violations = 0 1 2 J 4 J 3 J 2 J 1 J 13 J 13 J 2 Max. Violations Allowed = 2
Simulation Approach CTC/SDSC Trace Load Variation Duplication/Expansion Deadline Calculator Deadline-based Trace Qo. PS Simulation MSB Simulation MRT Simulation EASY Simulation
Trace Generation § Many job logs available, but no associated deadlines § Synthetic Deadline Generation – Generate a schedule for the job trace using EASY – For any job J, if the Turnaround time in this schedule is T – Deadline for J = Arrival Time + max (runtime, (1 -SF) x T) – SF is the “Stringency factor” (0 < SF < 1) § 0 would give the least stringent deadlines and 1 the most stringent § Some jobs might not come with deadlines – Very lax deadlines to prevent starvation – If ‘T’ is the current expected Turnaround time, § Deadline = Arrival Time + max (24 hrs, R x T) – R is the “Relaxation Factor” of the schedule
- Public provision example
- Roei herzig
- Catering provision
- Relationship marketing philosophy
- Building customer relationship in service marketing
- How does informal care contribute to service provision
- Mpls class of service
- Costed provision map
- Samba ad dc
- Summarise types of early years provision
- Servicios ecosistémicos de soporte
- Unrealized profit in manufacturing account
- Apocalipsis 3:20
- Provision for receivables
- Nolco and mcit illustration
- Hpinterest
- Purp in accounting
- Wim debeuckelaere
- Depreciation motor vehicle
- Provision for depreciation
- Provision vs accrual
- Journal entry for provision for doubtful debts
- What is bills receivable with example
- As 29
- How to record contingent liability
- What is bookkeeping
- Packing instruction 620
- Provision and use of work equipment regulations
- Mangold case
- Beverage provision
- Bad debts and provision for bad debts
- Measurement of provision