Scheduling in HPC Resource Management System Queuing vs

















- Slides: 17
Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies for Parallel Processing (JSSPP) Workshop Jerry Chou 8/29/2005
Outline Background ¢ Queuing and Planning Systems ¢ Advanced Planning Functions ¢ Example: Computing Center Software ¢ Conclusion ¢ Discussion ¢
Background ¢ ¢ HPC systems are operated by resource management systems (RMS) based on the queuing approach l PBS, SGE, Loveler, etc… Grid middleware emerges between resource management systems and applications l Globus, vg. ES, etc High level function (co-allocation) needs features from RMS l Advanced reservation, quality of service It is hard to realize those features with RMS because it only consider present resource usage => This paper purpose planning system to close the gap
Big Picture Application Co-allocation Grid Middleware Globus Qo. S RMS (PBS) RMS (Loadleveler) Resources vg. ES Advanced Reservation RMS (SGE) RMS (Condor)
Queuing and Planning Systems Queuing Systems ¢ Planning Systems ¢ Queuing vs. Planning Systems ¢
Queuing Systems ¢ Queues have different limits on the resource requests l l l ¢ Jobs are sorted by schedule policy in the queue l ¢ ¢ Number of resources requested Execution time Interactive/Batch jobs The highest priority request is the queue head If more than one queue can be started, further criteria are needed, such as Queue priority If no queue head can be started, the idle resources may be utilized with backfilling
Planning Systems - Replanning ¢ ¢ ¢ Requested l Start time l Estimated run time When l A new request is submitted l A running request ends before it’s estimated end time How l Delete all non-reservations from schedule l Sort non-reservations according to schedule policy l Arrange reservations into schedule l Insert non-reservations in the schedule at the earliest possible start time
Queuing vs. Planning Systems Queuing Planning time frame Present and Future Submission of resource requests Insert in queue Replanning Assignment of proposed start time No All requested Runtime estimates Not necessary Yes Reservation Not possible Yes Backfilling Option Yes
Advanced Planning Functions Requesting Resources ¢ Dynamic Aspects ¢ Service Level Agreements ¢
Requesting Resources ¢ Diffuse requests Give a range: “need 32~128 CPUs” l Let RMS optimizes: “need as much nodes as possible” l ¢ Negotiation
Dynamic Aspects ¢ Variable Reservations l Make a reservation ASAP l Different from reserved jobs: • No fix start time l Different from non-reserved jobs: • Never planed later than its first planned start time ¢ ¢ Resource Reclaiming l Replace requested resources at run time Automatic Duration Extension l Extend the runtime of jobs while they are running l How long can it be extended l Hoe many time it can be extended
Dynamic Aspects (Cont. ) ¢ Automatic Restart l ¢ Space sharing “Cycle Stealing” l ¢ It can utilize short time slots in the scheduling Run as a background job to steal resources in a space sharing system (like condor) Deployment Servers l RMS plans both the requested resources and the time to reconfigure the hardware
Service Level Agreements (SLA) SLA has to be considered not only in the scheduling process but also during the runtime ¢ At runtime the scheduler is not responsible for measuring the fulfillment of the SLA, but to provide all granted resources ¢
Computing Center Software (CCS) ¢ Architecture l l l User Interface (UI): provide single access point to one or more systems Access Manager (AM): manages the user interface and is responsible for authentication, authorization and accounting Planning Manager (PM): plans the user requests onto the machine Manager (MM): provides machine specific feature Island Manager (IM): provide CCS internal services and watchdog facilities to keep the island in a stable condition
Process Flow User: specify the expected duration of their requests Requests PM: re-plans the schedule • Fix-time Request: request reserves resource for a given time • Var-time Request: can move to a earlier time slot when replanning Schedule MM: maps schedule to machines No No Verify if a schedule can be realized with the available hardware. Yes Find alternative time Send conflict list to PM Conflict List Can PM accept? Yes Done
Conclusion Classify and compare queuing systems with planning systems ¢ Present possible advanced planning functionality ¢ The aim of the paper is to show the benefit of planning systems for managing HPC machines ¢
Discussion ¢ ¢ ¢ Does planning system solve all the problem? l What if most of jobs want to run ASAP l What if runtime is not estimated precisely What’s the performance and utilization comparison between queuing systems and planning systems l If you are resource provider, will you use it? What feature could be provided by vg. ES? l Diffuse requests l Resource reclaiming l Variable reservation l Negotiation