University Dortmund Tutorial Grid Resource Management and Scheduling

Agenda n Background on Resource Management and Scheduling è n Transition to Grid RM

Introduction n We all know what “the Grid” is… è one of the many

Resource Management on HPC Resources n n HPC resources are usually parallel computers or

HPC Management Architecture in General Control Service Job Master Resource and Job Monitoring and

Computational Job n A job is a computational task è è n that requires

Example: PBS Job Description n Simple job script: whole job file is a shell

Job Submission n The user “submits” the job to the RMS e. g. issuing

PBS Structure qsub jobscript Job Submission Management Server Job Execution Processing Node Scheduler Job

Execution Alternatives Time sharing: n The local scheduler starts multiple processes per physical CPU

Job Classifications n Batch Jobs vs interactive jobs è è n Parallel vs. sequential

Preemption n A job is preempted by interrupting its current execution è è n

Job Scheduling n A job is assigned to resources through a scheduling process è

Typical Scheduling Objectives n Minimizing the Average Weighted Response Time n n r :

Job Steps n A user job enters a job queue, the scheduler (its strategy)

Scheduling Algorithms: FCFS n n n Well known and very simple: First-Come First-Serve Jobs

FCFS Schedule Queue Scheduler time 1. Time Schedule 2. Job-Queue 3. 4… Resources Procssing

Scheduling Algorithms: Backfilling n n Improvement over FCFS A job can be started before

Backfill Scheduling n Job 3 is started before Job 2 as it does not

Backfill Scheduling However, if a job finishes earlier than expected, the backfilling causes delays

Job Execution Manager n After the scheduling process, the RMS is responsible for the

Scheduling Options n Parallel job scheduling algorithms are well studied; performance is usually acceptable

University Dortmund Transition to Grid Resource Management and Scheduling Current state of the art

Transition to the Grid More resource types come into play: n Resources are any

Implications to Grid Resource Management n Several security-related issues have to be considered: authentication,

Scope of Grids Cluster Grid Source: Ian Foster Enterprise Grid Global Grid 26

Resource Management Layer Grid Resource Management System consists of : n Local resource management

Grid Middleware “Coordination of several resources”: infrastructure services, application services “Add resource”: Negotiate access,

Grid Middleware (2) Higher-Level Services User/ Application Core Grid Infrastructure Services Information Services Grid

Globus Grid Middleware n Globus Toolkit è è n common source for Grid middleware

Globus Job Execution n n Job is described in the resource specification language Discover

Globus GT 2 Execution User/Application RSL Resource Broker RSL Specialized RSL Resource Allocation MDS

RSL n n Grid jobs are described in the resource specification language (RSL) RSL

Globus Job States suspended pending stage-in active stage-out done failed 34

Globus GT 3 n With transition to Web/Grid-Services, the job management becomes è è

Globus GT 3 Job Execution User/Application Master Managed Job Factory Service Managed Factory Job

Example: Extending the Globus Architecture at KAIST 1 Client Job Monitoring Service Job Submission

Job Description with RSL 2 n n The version 2 of RSL is XML-based

RSL 2 Attributes n <count> (type = rsl: integer. Type) è n <host. Count>

Job Submission Tools n n GT 3 provides the Java class Gram. Client GT

Globus 2 Job Client Interface A simple job submission requiring 2 nodes: globus-job-run –np

Globus 2 Job Client Interface n n The full flexibility of RSL is available

Problem: Job Submission Descriptions differ The deliverables of the GGF Working Group JSDL: n

JSDL Attribute Categories n The job attribute categories will include: è Job Identity Attributes

Problem: Resource Management Systems Differ Across Each Component Executing Host Application Task definition Submit

GGF-WG DRMAA GGF Working Group “Distributed Resource Management Application API” From the charter: n

DRMAA State Diagram The remote job could be in following states: n n n

Example: Condor-G n Condor-G is a Condor system enhanced to manage Globus jobs. It

Condor-G: Glide. In n Globus is used to run the Condor daemons on Grid

600 Condor jobs Using Glide. In Condor-G Grid Resource Schedd Job. Manager Grid. Manager

Example: DAGMan Directed Acyclic Graph Manager n n n DAGMan allows you to specify

University Dortmund Grid Scheduling How to select resources in the Grid?

Different Level of Scheduling n Resource-level scheduler è è è n Enterprise-level scheduler è

Grid-Level Scheduler n Discovers & selects the appropriate resource(s) for a job If selected

Grid Scheduling Grid User Grid-Scheduler time Scheduler Schedule Job-Queue Machine 1 Machine 2 Machine

Activities of a Grid Scheduler n GGF Document: “ 10 Actions of Super Scheduling

Grid Scheduling n A Grid scheduler allows the user to specify the required resources

Select a Resource for Execution n Most systems do not provide advance information about

Selection Criteria n Distribute jobs in order to balance load across resources è n

Scheduling Attributes Working Group in the Global Grid Forum to “Define the attributes of

Scheduling Attributes (2) n Attributes of available information è Access to tentative schedule l

Scheduling Attributes (3) n Attributes for requesting resources è Allocation Offers l è Allocation

CSF – Community Scheduler Framework n An open source implementation of OGSA-based metascheduler for

CSF Architecture Metascheduler Plugin Platform LSF User Globus Toolkit User LSF Grid Service Hosting

Global Information Service Rsrc Info Req Cluster Info Req Data Store Req Data Load

Support for Virtual Organizations User of VO “A” Virtual Organization “A” Organization 1 Source:

CSF Grid Services n Job Service: è n Reservation Service: è n guarantees resources

GT 3 Job Submission / Architecture MMJFS = Master Managed Job Factory Service MJS

GT 3 + CSF Architecture RM Adapter for LSF Queuing Service LSF RIPS Job

Queue Service n n n In CSF, Job Service instances are “submitted” to the

Example: Project Grid. Lab - GRMS Authorization System Information Services Data Management Resource Discovery

Anticipated Features Reliable and predictable delivery of a service n Quality of service for

Co-allocation n It is often requested that several resources are used for a single

Example Multi-Site Job Execution Grid-Scheduler n è time Multi-Side Job Schedule Job-Queue Machine 1

Advanced Reservation n n Co-allocation and other applications require a priori information about the

Limitations of current Grid RMS n The interaction between local scheduling and higher-level Grid

Example of Grid Scheduling Decision Making Where to put the Grid job? Grid User

Available Information from the Local Schedulers n Decision making is difficult for the Grid

Consequence n n n Consider a workflow with 3 short steps (e. g. 1

University Dortmund GRMS in Next Generation Grids Outlook on future Grid Resource Management and

Example Grid Scenario Remote Center Compute Resources WAN Transfer Reads and Generates TB of

Resource Request of a Simple Grid Job n A specified architecture with l 48

Use-Case Example: Coordinated Simulation and Visualization Expected output of a Grid scheduler: resources time

Need for a Grid Scheduling Architecture Grid. User/Application Information Services Grid. Scheduler Monitoring Services

Required Services/Components n Relevant to Grid scheduling è è è Information Service Job/Workflow Description

Service Oriented Architectures n n Services are used to abstract all resources and functionalities.

Open Grid Services Architecture Users in Problem Domain X Application & Integration Technology for

OGSA Outlook Data Catalog Data Provision Data Integration Virtual Organization Policy + Agreement Data

OGSA Execution Planning “Demand” “Supply” Workload Mgmt. Resource Mgmt. Framework Environment Framework Mgmt. User/Job

Functional Requirements for Grid Scheduling Functional Requirements: è è è è Cooperation between different

What are Basic Blocks for a Grid Scheduling Architecture? Information Service Scheduling Service static

Information Service Resource Discovery Relevant for Grid Scheduling: n Access to static and dynamic

Job/Workflow Description Requirement Description n n Information about the job specifics (what is the

Reservation Management Agreement and Negotiation n Interaction between scheduling instances, between resource/agreement providers and

Accounting and Billing n Interaction to budget information Charging for allocations, reservations; preliminary allocation

Monitoring Services n Monitoring of è è è è n resource conditions agreements schedules

Conclusions for Grid Scheduling Grids ultimately require coordinated scheduling services. n Support for different

Scheduling Model Using a Brokerage/Trading strategy: Submit Grid Job Description Discover Resources Consider individual

Properties of Multi-Level Scheduling Model n n Multi-level scheduling must support different RM systems

Negotiation in Grids n Multilevel Grid scheduling architecture è Lower level local scheduling instance

Using Service Level Agreements n n The mapping of jobs to resources can be

Service Level Agreement Types n Resource SLA (RSLA) è A promise of resource availability

Agreement-Based Negotiation n A client (application) submits a task to a Grid scheduler è

Agreement-Based Negotiation (2) n The job starts execution on the resource according to the

Example of Agreement Mapping n User/ Application TSLA 2 TSLA 1 Grid Scheduler RSLA

GGF: GRAAP-WG n n Goal: Defining Web. Service-based protocols for negotiation and agreement management

Towards Grid Scheduling Methods: Support for individual scheduling objectives and policies è Multi-criteria scheduling

Grid Scheduling Strategies Current approach: n Extension of job scheduling for parallel computers. n

User Objective Local computing typically has: è è è A given scheduling objective as

Provider/Owner Objective Local computing typically has: è è Single scheduling objective for the whole

Grid Economics – Different Business Models n Cost model è è n Individual scheduling

Scheduling Objectives in the Grid n In contrast to local computing, there is no

Grid Scheduling Algorithms n Due to the mentioned requirements in Grids its not to

Economic Scheduling n Market-oriented approaches are a suitable way to implement the interaction of

Economic Scheduling (2) 1. Several possibilities for market models: 1. 2. n Offer-request mechanisms

Problem: Offer Creation Job t t 4 t 3 t 2 R 1 R

Offer Creation (2) Offer 1 t End of current schedule R 1 R 2

Offer Creation (3) Offer 2 t End of current schedule R 1 R 2

Offer Creation (4) Offer 3 New end of current schedule t End of current

Evaluate Offers n Evaluation with utility functions l A utility function is a mathematical

Optimization Space A 3 Im pro Uti ved lity latency A 3 A 2

Example: NWIRE n Following are some slides to illustrate the scheduling process in the

Evaluation of Economic Approach n n Real Workload Data of Grids is not yet

Scheduling Process 1) Client sends a request Requirements Job Attributes Objective. Function Grid Manager

Example Request KEY “Hops“ {VALUE “HOPS“ {2}} KEY “Max. Offer. Number“ {VALUE KEY “Min.

Scheduling Process 3) Limitations of Request direction and depth Request Grid Manager Request 2)

Scheduling Process Request Grid Manager Request Offer Attributes Objective. Function Grid Manager Application 4)

Scheduling Process Request Grid Manager Request Offer Schedule Allokation_1 Allokation_2 Allokation_3 Grid Manager Application

Offer Creation D C Zeit B A 1 2 3 4 5 6 7

AWRT for Economic Scheduling Compared to Conservative Backfilling Conventional algoritms better Markted-oriented better 137

Utilization Markted-oriented better Conventional better 138

AWRT for different Machine Utility Functions 139

Results on Economic Models n n n Economical models provide results in the range

Extending Resource Types n Data is a key resource in the Grid è è

Data and Network Scheduling Most new resource types can be included via individual lower-level

Data Management n n n Access to information about the location of data sets

Functional View of Grid Data Management Application Metadata Service Planner: Data location, Replica selection,

Replica Location Service In Context n n n The Replica Location Service is one

Example of a Scheduling Process Scheduling Service: 1. receives job description 2. queries Information

Re-Scheduling n Reconsidering a schedule with already made agreements may be a good idea

Activities n Core service infrastructure OGSI/WSRF è OGSA è n GGF hosts several groups

Conclusion n Resource management and scheduling is a key service in an Next Generation

Further Outlook Commercial Interest in the Grid may have stronger focus on non HPCrelated

References n Book: “Grid Resource Management: State of the Art and Future Trends”, co-editors

Contact: ramin. yahyapour@udo. edu Computer Engineering Institute University of Dortmund 44221 Dortmund, Germany http:

Slides: 153

Download presentation

University Dortmund Tutorial Grid Resource Management and Scheduling Euro. Par 2004 Pisa, Italy Ramin Yahyapour CEI, University of Dortmund

Agenda n Background on Resource Management and Scheduling è n Transition to Grid RM and Grid Scheduling è n Job scheduling and resource management on HPC resources Current state-of-the-art Future of GRMS è Requirements, outlook for upcoming GRMS 2

Introduction n We all know what “the Grid” is… è one of the many definitions: “Resource sharing & coordinated problem solving in dynamic, multiinstitutional virtual organizations” (Ian Foster) è n however, the actual scope of “the Grid” is still quite controversial Many people consider High Performance Computing (HPC) as the main Grid application. è è è today’s Grids are mostly Computational Grids or Data Grids with HPC resources as building blocks thus, Grid resource management is much related to resource management on HPC resources (our starting point). we will return to a broader Grid scope and its implications later 3

Resource Management on HPC Resources n n HPC resources are usually parallel computers or large scale clusters The local resource management systems (RMS) for such resources includes: è è è n n n configuration management monitoring of machine state job management There is no standard for this resource management. Several different proprietary solutions are in use. Examples for job management systems: è PBS, LSF, NQS, Load. Leveler, Condor 4

HPC Management Architecture in General Control Service Job Master Resource and Job Monitoring and Management Services Compute Resources/ Processing Nodes Master Server Resource/ Job Monitor Compute Node 5

Computational Job n A job is a computational task è è n that requires processing capabilities (e. g. 64 nodes) and is subject to constraints (e. g. a specific other job must finish before the start of this job) The job information is provided by the user è resource requirements l l è è n CPU architecture, number of nodes, speed memory size per CPU software libraries, licenses I/O capabilities job description additional constraints and preferences The format of job description is not standardized, but usually very similar 6

Example: PBS Job Description n Simple job script: whole job file is a shell script #!/bin/csh # resource limits: allocate needed nodes #PBS -l nodes=1 information for the RMS are comments # # resource limits: amount of memory and CPU time ([[h: ]m: ]s). #PBS -l mem=256 mb #PBS -l cput=2: 00 # path/filename for standard output #PBS -o master: /mypath/myjob. out. /my-task the actual job is started in the script 7

Job Submission n The user “submits” the job to the RMS e. g. issuing “qsub jobscript. pbs” n The user can control the job è è è n n qsub: submit qstat: poll status information qdel: cancel job It is the task of the resource management system to start a job on the required resources Current system state is taken into account 8

PBS Structure qsub jobscript Job Submission Management Server Job Execution Processing Node Scheduler Job & Resource Monitor 9

Execution Alternatives Time sharing: n The local scheduler starts multiple processes per physical CPU with the goal of increasing resource utilization. è n multi-tasking The scheduler may also suspend jobs to keep the system load under control è preemption Space sharing: n The job uses the requested resources exclusively; no other job is allocated to the same set of CPUs. è The job has to be queued until sufficient resources are free. 10

Job Classifications n Batch Jobs vs interactive jobs è è n Parallel vs. sequential jobs è n batch jobs are queued until execution interactive jobs need immediate resource allocation a job requires several processing nodes in parallel the majority of HPC installations are used to run batch jobs in spacesharing mode! è è a job is not influenced by other co-allocated jobs the assigned processors, node memory, caches etc. are exclusively available for a single job. overhead for context switches is minimized important aspects for parallel applications 11

Preemption n A job is preempted by interrupting its current execution è è n n the job might be on hold on a CPU set and later resumed; job still resident on that nodes (consumption of memory) alternatively a checkpoint is written and the job is migrated to another resource where it is restarted later Preemption can be useful to reallocate resources due to new job submissions (e. g. with higher priority) or if a job is running longer then expected. 12

Job Scheduling n A job is assigned to resources through a scheduling process è è è n n responsible for identifying available resources matching job requirements to resources making decision about job ordering and priorities HPC resources are typically subject to high utilization therefore, resources are not immediately available and jobs are queued for future execution è è time until execution is often quite long (many production systems have an average delay until execution of >1 h) jobs may run for a long time (several hours, days or weeks) 13

Typical Scheduling Objectives n Minimizing the Average Weighted Response Time n n r : submission time of a job t : completion time of a job w : weight/priority of a job Maximize machine utilization/minimize idle time è è conflicting objective criteria is usually static for an installation and implicit given by the scheduling algorithm 14

Job Steps n A user job enters a job queue, the scheduler (its strategy) decides on start time and resource allocation of the job. time n Scheduler Schedule Grid. User Job Description Job Execution Management lokale Job-Queue Node Job Mgmt HPC Machine Node Job Mgmt 15

Scheduling Algorithms: FCFS n n n Well known and very simple: First-Come First-Serve Jobs are started in order of submission Ad-hoc scheduling when resources become free again è n Advantage: è è è n no advance scheduling simple to implement easy to understand fair for the users (job queue represents execution order) does not require a priori knowledge about job lengths Problems: è performance can extremely degrade; overall utilization of a machine can suffer if highly parallel jobs occur, that is, if a significant share of nodes is requested for a single job. 16

FCFS Schedule Queue Scheduler time 1. Time Schedule 2. Job-Queue 3. 4… Resources Procssing Nodes Compute Resource 17

Scheduling Algorithms: Backfilling n n Improvement over FCFS A job can be started before an earlier submitted job if it does not delay the first job in the queue è n n Some fairness is still maintained Advantage: è n may still cause delay of other jobs further down the queue utilization is improved Information about the job execution length is needed è è sometimes difficult to provide user estimation not necessarily accurate Jobs are usually terminated after exceeding its allocated execution time; otherwise users may deliberately underestimate the job length to get an earlier job start time 18

Backfill Scheduling n Job 3 is started before Job 2 as it does not delay it Queue Scheduler time 1. Schedule Time 2. Job-Queue 3. 4… Resources Procssing Nodes Compute Resource 19

Backfill Scheduling However, if a job finishes earlier than expected, the backfilling causes delays that otherwise would not occur è need for accurate job length information (difficult to obtain) Queue Scheduler Job finishes earlier! Time 2. time 1. Schedule Job-Queue 3. 4… Resources Procssing Nodes Compute Resource 20

Job Execution Manager n After the scheduling process, the RMS is responsible for the job execution: sets up the execution environment for a job, è starts a job, è monitors job state, and è cleans-up after execution (copying output-files etc. ) è notifies the used (e. g. sending email) è 21

Scheduling Options n Parallel job scheduling algorithms are well studied; performance is usually acceptable Real implementations may have addition requirements instead of need of more complex theoretical algorithms: n Prioritization of jobs, users, or groups while maintaining fairness n Partitioning of machines è n e. g. : interactive and development partition vs. production batch partitions Combination of different queue characteristics For instance, the Maui Scheduler is often deployed as it is quite flexible in terms of prioritization, backfilling, fairness etc. 22

University Dortmund Transition to Grid Resource Management and Scheduling Current state of the art

Transition to the Grid More resource types come into play: n Resources are any kind of entity, service or capability to perform a specific task è è è n processing nodes, memory, storage, networks, experimental devices, instruments data, software, licenses people The task/job/activity can also be of a broader meaning è a job may involve different resources and consists of several activities in a workflow with according dependencies n The resources are distributed and may belong to different administrative domains n HPC is still key the application for Grids. Consequently, the main resources in a Grid are the previously considered HPC machines with their local RMS 24

Implications to Grid Resource Management n Several security-related issues have to be considered: authentication, authorization, accounting è è n There is lack of global information: è n who has access to a certain resource? what information can be exposed to whom? what resources are when available for an activity? The resources are quite heterogeneous: è è è different RMS in use individual access and usage paradigms administrative policies have to be considered 25

Scope of Grids Cluster Grid Source: Ian Foster Enterprise Grid Global Grid 26

Resource Management Layer Grid Resource Management System consists of : n Local resource management system (Resource Layer) è è è n Basic resource management unit Provide a standard interface for using remote resources e. g. GRAM, etc. Global resource management system (Collective Layer) è è Coordinate all Local resource management system within multiple or distributed Virtual Organizations (VOs) Provide high-level functionalities to efficiently use all of resources l l l è Job Submission Resource Discovery and Selection Scheduling Co-allocation Job Monitoring, etc. e. g. Meta-scheduler, Resource Broker, etc. 27

Grid Middleware “Coordination of several resources”: infrastructure services, application services “Add resource”: Negotiate access, control access and utilization “Communication with internal resource functions and services” Collective Application Resource Connectivity Transport Internet Protocol Architecture Application Internet “Control local execution” Source: Ian Foster Fabric Link 28

Grid Middleware (2) Higher-Level Services User/ Application Core Grid Infrastructure Services Information Services Grid Middleware Resource Broker Monitoring Services Security Services Local Resource Management Grid Resource Manager PBS LSF … Resource 29

Globus Grid Middleware n Globus Toolkit è è n common source for Grid middleware GT 2 GT 3 – Web/Grid. Service-based GT 4 – WSRF-based GRAM is responsible for providing a service for a given job specification that can: è è è Create an environment for a job Stage files to/from the environment Submit a job to a local scheduler Monitor a job Send job state change notifications Stream a job’s stdout/err during execution 30

Globus Job Execution n n Job is described in the resource specification language Discover a Job Service for execution è è n Alternatively, choose a Grid Scheduler for job distribution è è n n n Grid scheduler selects a job service and forwards job to it A Grid scheduler is not part of Globus The Job Service prepares job for submission to local scheduling system If necessary, file stage-in is performed è n Job Manager in Globus 2. x (GT 2) Master Management Job Factory Service (MMJFS) in Globus 3. x (GT 3) e. g. using the GASS service The job is submitted to the local scheduling system If necessary, file stage-out is performed after job finishes. 31

Globus GT 2 Execution User/Application RSL Resource Broker RSL Specialized RSL Resource Allocation MDS GRAM PBS LSF … Resource 32

RSL n n Grid jobs are described in the resource specification language (RSL) RSL Version 1 is used in GT 2 It has an LDAP filter-like syntax that supports boolean expressions: Example: & (executable = a. out) (directory = /home/nobody ) (arguments = arg 1 "arg 2") (count = 1) 33

Globus Job States suspended pending stage-in active stage-out done failed 34

Globus GT 3 n With transition to Web/Grid-Services, the job management becomes è è è n n n the Master Managed Job Factory Service (MMJFS) the Managed Job Factory Service (MJFS) the Managed Job Service (MJS) The client contacts the MMJFS informs the MJFS to create a MJS for the job The MJS takes care of managing the job actions. è è è interact with local scheduler file staging store job status 35

Globus GT 3 Job Execution User/Application Master Managed Job Factory Service Managed Factory Job Service File Streaming Factor Service Resource Information Provider Service Managed Job Service File Streaming Service Local Scheduler n Globus as a toolkit does not perform scheduling and automatic resource selection 36

Example: Extending the Globus Architecture at KAIST 1 Client Job Monitoring Service Job Submission Service 9 2 Resource Preference Provider 8 Source: Jin-Soo Kim Information Provider Resource Selection Service 4 Resource Information Service Resource Reservation Service Job Manger Service (MJS) Providers 3 5 7 Local Resource Monitoring Service (RIPS) Scheduling Service 6 Local Resource Manager (PBS) Workflow Monitoring Information Flow 37

Job Description with RSL 2 n n The version 2 of RSL is XML-based Two namespaces are used: è è rsl: for basic types as int, string, path, url gram: for the elements of a job *GNS = “http: //www. globus. org/namespaces“ <? xml version="1. 0" encoding="UTF-8"? > <rsl: rsl xmlns: rsl="GNS/2003/04/rsl" xmlns: gram="GNS/2003/04/rsl/gram" xmlns: xsi="http: //www. w 3. org/2001/ XMLSchema-instance" xsi: schema. Location=" GNS/2003/04/rsl. /schema/base/gram/rsl. xsd GNS/2003/04/rsl/gram. /schema/base/gram_ rsl. xsd"> <gram: job> <gram: executable>< rsl: path> <rsl: string. Element value="/bin/a. out"/> </rsl: path></gram: executable> </gram: job> </rsl: rsl> 38

RSL 2 Attributes n <count> (type = rsl: integer. Type) è n <host. Count> (type = rsl: integer. Type) è è n Maximum wall clock runtime in minutes <max. Cpu. Time> (type = rsl: long. Type) è n Queue into which to submit job <max. Wall. Time> (type = rsl: long. Type) è n On SMP multi-computers, number of nodes to distribute the “count” processes across count/host. Count = number of processes per host <queue> (type = rsl: string. Type) è n Number of processes to run (default is 1) Maximum CPU runtime in minutes <max. Time> (type = rsl: long. Type) è è Only applies if above are not used Maximum wall clock or cpu runtime (schedulers’s choice) in minutes 39

Job Submission Tools n n GT 3 provides the Java class Gram. Client GT 2. x: command line programs for job submission globus-job-run: interactive jobs è globus-job-submit: batch jobs è globusrun: takes RSL as input è 40

Globus 2 Job Client Interface A simple job submission requiring 2 nodes: globus-job-run –np 2 –s myprog arg 1 arg 2 A multirequest specifies multiple resources for a job globus-job-run -dumprsl -: host 1 /bin/uname -a -: host 2 /bin/uname –a + ( &(resource. Manager. Contact="host 1") (subjob. Start. Type=strict-barrier) (label="subjob 0") (executable="/bin/uname") (arguments= "-a") ) ( &(resource. Manager. Contact="host 2") (subjob. Start. Type=strict-barrier)(label="subjob 1") (executable="/bin/uname") (arguments= "-a") ) 41

Globus 2 Job Client Interface n n The full flexibility of RSL is available through the command line tool globusrun Support for file staging: executable and stdin/stdout Example: globusrun -o –r hpc 1. acme. com/jobmanager-pbs '&(executable=$(HOME)/a. out) (jobtype=single) (queue=time-shared)’ 42

Problem: Job Submission Descriptions differ The deliverables of the GGF Working Group JSDL: n A specification for an abstract standard Job Submission Description Language (JSDL) that is independent of language bindings, including; è è è the JSDL feature set and attribute semantics, the definition of the relationship between attributes, and the range of attribute values. n A normative XML Schema corresponding to the JSDL specification. n A document of translation tables to and from the scheduling languages of a set of popular batch systems for both the job requirements and resource description attributes of those languages, which are relevant to the JSDL. 43

JSDL Attribute Categories n The job attribute categories will include: è Job Identity Attributes l è Job Resource Attributes l è databases, files, data formats, and staging, replication, caching, and disk requirements, etc. Job Scheduling Attributes l è environment variables, argument lists, etc. Job Data Attributes l è hardware, software, including applications, Web and Grid Services, etc. Job Environment Attributes l è ID, owner, group, project, type, etc. start and end times, duration, immediate dependencies etc. Job Security Attributes l authentication, authorisation, data encryption, etc. 44

Problem: Resource Management Systems Differ Across Each Component Executing Host Application Task definition Submit Staging Scheduler Task Analysis LSF Grid Engine PBS Source: Hrabri Rajic Interface Format Execution Environment Platform Mix Has API plus Batch Utilities via “LSF Scripts” User: Local disk exported System: Remote initialized (option) Unix, Windows GDI API Interface plus Command line interface System: Remote initialized, with SGE local variables exported Unix only API (script option) Batch Utilities via “PBS Scripts” System: Remote initialized, with PBS local variables exported Unix only 45

GGF-WG DRMAA GGF Working Group “Distributed Resource Management Application API” From the charter: n Develop an API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems. n The scope of this specification is all the high level functionality which is necessary for an application to consign a job to a DRM system including common operations on jobs like termination or suspension. n The objective is to facilitate the direct interfacing of applications to today's DRM systems by application's builders, portal builders, and Independent Software Vendors (ISVs). 46

DRMAA State Diagram The remote job could be in following states: n n n n n Source: Hrabri Rajic system hold user hold system and user hold simultaneously queued active system suspended user suspended system and user suspended simultaneously running finished (un)successfully 47

Example: Condor-G n Condor-G is a Condor system enhanced to manage Globus jobs. It provides two main features n Globus Universe: interface for submitting, queuing and monitoring jobs that use Globus resources n Glide. In: system for efficient execution of jobs on remote Globus resources Condor-G runs as a “personal Condor” system è è daemons run as non-privileged user processes each user runs her/his Condor-G 48

Condor-G: Glide. In n Globus is used to run the Condor daemons on Grid resources è n Condor daemons run as a Globusmanaged job GRAM service starts daemons rather than the Condor jobs è When the resources run these Glide. In jobs, they will join the personal Condor pool n These daemons can be used to launch a job from Condor-G to a Globus resource n Jobs are submitted as Condor jobs and they will be matched and run on the Grid resources è n the daemons receive jobs from the user’s Condor queue combines the benefits of Globus and Condor 49

600 Condor jobs Using Glide. In Condor-G Grid Resource Schedd Job. Manager Grid. Manager LSF Startd Collector glide-ins Source: ANL/USC ISI 50

Example: DAGMan Directed Acyclic Graph Manager n n n DAGMan allows you to specify the dependencies between your Condor-G jobs, so it can manage them automatically for you. (e. g. , “Don’t run job “B” until job “A” has completed successfully. ”) A DAG is defined by a. dag file, listing each of its nodes and their dependencies: Job A # diamond. dag Job A a. sub Job B b. sub Job C c. sub Job D d. sub Parent A Child B C Parent B C Child D Job B Job C Job D n each node will run the Condor-G job specified by its accompanying Condor submit file Source: Miron Livny 51

University Dortmund Grid Scheduling How to select resources in the Grid?

Different Level of Scheduling n Resource-level scheduler è è è n Enterprise-level scheduler è è n low-level scheduler, local resource manager scheduler close to the resource, controlling a supercomputer, cluster, or network of workstations, on the same local area network Examples: Open PBS, PBS Pro, LSF, SGE Scheduling across multiple local schedulers belonging to the same organization Examples: PBS Pro peer scheduling, LSF Multicluster Grid-level scheduler è è è also known as super-scheduler, broker, community scheduler Discovers resources that can meet a job’s requirements Schedules across lower level schedulers 53

Grid-Level Scheduler n Discovers & selects the appropriate resource(s) for a job If selected resources are under the control of several local schedulers, a meta-scheduling action is performed n Architecture: n è Centralized: all lower level schedulers are under the control of a single Grid scheduler l è not realistic in global Grids Distributed: lower level schedulers are under the control of several grid scheduler components; a local scheduler may receive jobs from several components of the grid scheduler 54

Grid Scheduling Grid User Grid-Scheduler time Scheduler Schedule Job-Queue Machine 1 Machine 2 Machine 3 55

Activities of a Grid Scheduler n GGF Document: “ 10 Actions of Super Scheduling (GFD-I. 4)” Source: Jennifer Schopf 56

Grid Scheduling n A Grid scheduler allows the user to specify the required resources and environment of the job without having to indicate the exact location of the resources è n A Grid scheduler answers the question: to which local resource manger(s) should this job be submitted? Answering this question is hard: resources may dynamically join and leave a computational grid è not all currently unused resources are available to grid jobs: è l è resource owner policies such as “maximum number of grid jobs allowed” it is hard to predict how long jobs will wait in a queue 57

Select a Resource for Execution n Most systems do not provide advance information about future job execution è è n Grid scheduler might consider current queue situation, however this does not give reliable information for future executions: è n user information not accurate as mentioned before new jobs arrive that may surpass current queue entries due to higher priority A job may wait long in a short queue while it would have been executed earlier on another system. Available information: è è Grid information service gives the state of the resources and possibly authorization information Prediction heuristics: estimate job’s wait time for a given resource, based on the current state and the job’s requirements. 58

Selection Criteria n Distribute jobs in order to balance load across resources è n n not suitable for large scale grids with different providers Data affinity: run job on the resource where data is located Use heuristics to estimate job execution time. Best-fit: select the set of resources with the smallest capabilities and capacities that can meet job’s requirements Quality of Service of è è a resource or its local resource management system l l what features has the local RMS? can they be controlled from the Grid scheduler? 59

Scheduling Attributes Working Group in the Global Grid Forum to “Define the attributes of a lower-level scheduling instance that can be exploited by a higher-level scheduling instance. ” The following attributes have been defined: n Attributes of allocation properties è Revocation of an allocation l è Guaranteed completion time of allocation, l è A job is not preempted if it has been started Exclusive Allocations l è A local scheduler will retry a given job task, e. g. useful for data transfer actions. Allocations run-to-completion l è A deadline for job completion is provided by the local scheduler. Guaranteed Number of Attempts to Complete a Job l è The local scheduler reserves the right to withdraw a given allocation. A job has exclusive access to the given resources; e. g. no time-sharing is performed Malleable Allocations l The given resource set may change during runtime; e. g. a computational job will gain (moldable job) or lose processors 60

Scheduling Attributes (2) n Attributes of available information è Access to tentative schedule l l l è Exclusive control l è The local scheduler is exclusively in charge of the resources; no other jobs can appear on the resources Event Notification l n The local scheduler exposes his schedule of future allocations Option: Only the projected start time of a specified allocation is available Option: Only partial information on the current schedule is available The local scheduler provides an event subscription service Attributes for manipulating allocation execution è è Preemption Checkpointing Migration Restart 61

Scheduling Attributes (3) n Attributes for requesting resources è Allocation Offers l è Allocation Cost or Objective Information l è A policy applies for the allocation that must be met to stay valid Remote Co-Scheduling l è The higher-level scheduler must provide a maximum job execution length Deallocation Policy l è Allocations can be reserved in advance Requirement for Providing Maximum Allocation Length in Advance l è The local scheduler can provide cost or objective information Advance Reservation l è The local system can provides an interface to request offers for an allocation A schedule can be generated by a higher-level instance and imposed on the local scheduler Consideration of Job Dependencies l The local scheduler can deal with dependency information of jobs; e. g. for workflows 62

CSF – Community Scheduler Framework n An open source implementation of OGSA-based metascheduler for VOs. è è n n Supports emerging WS-Agreement spec Supports GT GRAM Fills in gaps in existing resource management picture Contributed from platform to the Globus Toolkit Extensible, Open Source framework for implementing metaschedulers Provides basic protocols and interfaces to help resources work together in heterogeneous environments 63

CSF Architecture Metascheduler Plugin Platform LSF User Globus Toolkit User LSF Grid Service Hosting Environment Meta-Scheduler Global Information Service RIPS = Resource Information Provider Service Source: Chris Smith Job Service GRAM SGE RIPS GRAM PBS Reservation Service RIPS Queuing Service RM Adapter Platform LSF 64

Global Information Service Rsrc Info Req Cluster Info Req Data Store Req Data Load Req Global Information Service Index Service Cluster Registry SD SDProvider Aggregator Manager Rsrc, job rsv info RIPS Source: Chris Smith RIPS Rsv Info Job Info Data Storage Data Store Load DB 65

Support for Virtual Organizations User of VO “A” Virtual Organization “A” Organization 1 Source: Chris Smith User of VO “B” Community Scheduler Virtual Organization “B” Organization 3 Organization 2 66

CSF Grid Services n Job Service: è n Reservation Service: è n guarantees resources are available for running a job Queuing Service: è è n creates, monitors and controls compute jobs provides a service where administrators can customize and define scheduling policies at the VO level and/or at the different resource manager level Defines an API for plug in schedulers RM Adaptor Service: è provides a Grid service interface that bridges the Grid service protocol and resource managers (LSF, PBS, SGE, Condor and other RMs) 67

GT 3 Job Submission / Architecture MMJFS = Master Managed Job Factory Service MJS = Managed Job Service Blue indicates a Grid Service hosted in a GT 3 container MJS for LSF MMJFS RIPS Site A – MMJFS on node 1 managed-jobglobusrun MJS for PBS MMJFS RIPS Site B – MMJFS on node 2 MJS for SGE Index Service SGE MMJFS RIPS Site C – MMJFS on node 3 Source: Chris Smith 68

GT 3 + CSF Architecture RM Adapter for LSF Queuing Service LSF RIPS Job Service Reservation Service Site A RM Adapter for PBS Index Service RIPS Site B Virtual Organization MMJFS/MJS SGE RIPS Source: Chris Smith Site C 69

Queue Service n n n In CSF, Job Service instances are “submitted” to the Queue Service for dispatch to a resource manager. The Queue Service provides a plug in API for extending the scheduling algorithms provided by default with CSF. The Queue Service is responsible for: loads and validates configuration information è loads all configured scheduler plugins è calls the plugin API functions è l l sched. Init() after loading the plugin successfully sched. Order() when a new job is submitted sched. Match() during the scheduling cycle sched. Post() before the scheduling cycle ends, and after scheduling decisions are sent to the job service instances 70

Example: Project Grid. Lab - GRMS Authorization System Information Services Data Management Resource Discovery File Transfer Unit BROKER Execution Unit Adaptive Jobs Queue Job Receiver Monitoring SLA Negotiation GRMS Scheduler Resource Reservation Workflow Manager Prediction Unit Application Manager GLOBUS, other Local Resources (Managers) Source: Jarek Nabrzyski 71

Anticipated Features Reliable and predictable delivery of a service n Quality of service for a job service: è è n n Reliable job submission: two-phase commit Predictable start and end time of the job Advance reservation assures start time and throughput Fault tolerance/recovery: è è Migrate job to another resource before the fault occurs: the job continues after the fault: the job is restarted Rerun the job on the same resource after repair Allocate multiple resources for a job 72

Co-allocation n It is often requested that several resources are used for a single job. è that is, a scheduler has to assure that all resources are available when needed. l l n in parallel (e. g. visualization and processing) with time dependencies (e. g. a workflow) The task is especially difficult if the resources belong to different administrative domains. è è The actual allocation time must be known for co-allocation or the different local resource management systems must synchronize each other (wait for availability of all resources) 73

Example Multi-Site Job Execution Grid-Scheduler n è time Multi-Side Job Schedule Job-Queue Machine 1 Machine 2 Machine 3 A job uses several resources at different sites in parallel. Network communication is an issue. 74

Advanced Reservation n n Co-allocation and other applications require a priori information about the precise resource availability With the concept of advanced reservation, the resource provider guarantees a specified resource allocation. è n includes a two- or three-phase commit for agreeing on the reservation Implementations: è è GARA/DUROC/SNAP provide interfaces for Globus to create advanced reservation implementations for network Qo. S available. l setup of a dedicated bandwidth between endpoints 75

Limitations of current Grid RMS n The interaction between local scheduling and higher-level Grid scheduling is currently a one-way communication è è è n current local schedulers are not optimized for Grid-use limited information available about future job execution a site is usually selected by a Grid scheduler and the job enters the remote queue. The decision about job placement is inefficient. è è Actual job execution is usually not known Co-allocation is a problem as many systems do not provide advance reservation 76

Example of Grid Scheduling Decision Making Where to put the Grid job? Grid User Grid-Scheduler time 40 jobs running 80 jobs queued 5 jobs running 2 jobs queued 15 jobs running 20 jobs queued Schedule Job-Queue Machine 1 Machine 2 Machine 3 77

Available Information from the Local Schedulers n Decision making is difficult for the Grid scheduler è è n limited information about local schedulers is available information may not be reliable Possible information: è è queue length, running jobs detailed information about the queued jobs l è n n execution length, process requirements, … tentative schedule about future job executions These information are often technically not provided by the local scheduler In addition, these information may be subject to privacy concerns! 78

Consequence n n n Consider a workflow with 3 short steps (e. g. 1 minute each) that depend on each other Assume available machines with an average queue length of 1 hour. The Grid scheduler can only submit the subsequent step if the previous job step is finished. Result: è The completion time of the workflow may be larger than 3 hours (compared to 3 minutes of execution time) è Current Grids are suitable for simple jobs, but still quite inefficient in handling more complex applications Need for better coordination of higher- and lower-level scheduling! 79

University Dortmund GRMS in Next Generation Grids Outlook on future Grid Resource Management and Scheduling

Example Grid Scenario Remote Center Compute Resources WAN Transfer Reads and Generates TB of Data LAN/WAN Transfer Assume a data-intensive simulation that should be visualized and steered during runtime! Visualization 81

Resource Request of a Simple Grid Job n A specified architecture with l 48 processing nodes, l 1 GB of available memory, and l a specified licensed software package l for 1 hour between 8 am and 6 pm of the following day l Time must be known in advance. n A specific visualization device during program execution n Minimum bandwidth between the VR device and the main computer during program execution n Input: a specified data set from a data repository n at most 4 € l è preference of cheaper job execution over an earlier execution. actually a pretty “simple” example (no complex workflows) 82

Use-Case Example: Coordinated Simulation and Visualization Expected output of a Grid scheduler: resources time Data Access Storing Data Network 1 Data Transfer Computer 1 Network 2 Computer 2 Software License Storage Network 3 VR-Cave Loading Data Parallel Computation Providing Data Communication for Computation Parallel Computation Reservations are necessary! Software Usage Data Storage Communication for Visualization 83

Need for a Grid Scheduling Architecture Grid. User/Application Information Services Grid. Scheduler Monitoring Services Security Services Grid Middleware Schedule Compute Resources Grid Middleware Local RMS time Local RMS Grid Middleware Local RMS Accounting/Billing Other Grid Services time Grid Middleware Schedule Data Resources Network Resources Other Resources/ Services 84

Required Services/Components n Relevant to Grid scheduling è è è Information Service Job/Workflow Description Requirement Description Resource Discovery Reservation Monitoring/Notification Job Execution Security Accounting/Billing Data Management Local RMS 85

Service Oriented Architectures n n Services are used to abstract all resources and functionalities. Concept of OGSI and WSRF è è è n using Web. Services, SOAP, XML to implement the services OGSI idea of Grid. Services is implemented in GT 3 transition to WSRF with GT 4 Core service for building a Grid are discussed in the Open Grid Services Architecture (OGSA) 86

Open Grid Services Architecture Users in Problem Domain X Application & Integration Technology for Problem Domain X Generic Virtual Service Access and Integration Layer Job Submission Brokering Registry Banking Workflow Authorisation OGSA Structured Data Integration Data Transport Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Web Services: Basic Functionality Compute, Data & Storage Resources Distributed Structured Data Relational XML Virtual Integration Architecture Semi-structured - 87

OGSA Outlook Data Catalog Data Provision Data Integration Virtual Organization Policy + Agreement Data Access Context Services Data Services Application Content Manager Workflow Manager Workload Manager Broker Status + Problem Information Logging Monitoring Determination. Event Services Execution Job & Job Planning Workflow Manager Service Management Reservation Service Container Resource Management Deploy + Services Configuration Provisioning Service WS Infrastructure WS-RF Notification. Distributed (OGSI) Services Management Security Services Authentication Authorization Delegation Firewall Transition 88

OGSA Execution Planning “Demand” “Supply” Workload Mgmt. Resource Mgmt. Framework Environment Framework Mgmt. User/Job Proxies Policies Primary Interaction Job Resource Reservation Factory Information Provider Factory Dependency management CMM Meta - Interaction Resource Allocation Resource Provisioning (or Binding) Optimizing Framework Workload Optimizing Framework Scheduling Queuing Services Resource Optimizing Framework Capacity Management Workload Optimization Workload Post Balancing Resource Placement Resource – Workload Admission Control (Resources) Optimal Mapping Quality of Service (Resources) Workload Models (History/Prediction) Workload Orchestration Resource Selection Context (e. g. VO) Admission Control (Workload) SLA Management (Workload) Represents one or more OGSA services 89

Functional Requirements for Grid Scheduling Functional Requirements: è è è è Cooperation between different resource providers Interaction with local resource management systems Support for reservations and service level agreements Orchestration of coordinated resources allocations Automatic handling of accounting and billing Distributed Monitoring Failure Transparency 90

What are Basic Blocks for a Grid Scheduling Architecture? Information Service Scheduling Service static & scheduled/forecasted Query for resources Reservation Job Supervisor Service Data Management Service Network Management Service Data Manager Data-Resources Maintain information Network Manager Management System Network Management System Compute/ Storage /Visualization etc Data Network Accounting and Billing Service Compute Manager Maintain information Scheduling-relevant Interfaces of Basic Blocks are still to be defined! Network-Resources 91

Information Service Resource Discovery Relevant for Grid Scheduling: n Access to static and dynamic information n Dynamic information include data about planned or forecasted future events è è n e. g. : existing reservations, scheduled tasks, future availabilities need for anonymous and limited information (privacy concerns) Information about all resource types è including e. g. data and network l future reservation, data transfers etc. 92

Job/Workflow Description Requirement Description n n Information about the job specifics (what is the job) and job requirements (what is required for the job) including data access and creation Need for common workflow description è è è e. g. a DAG formulation include static and dynamic dependencies need for the ability to extract workflow information to schedule a whole workflow in advance 93

Reservation Management Agreement and Negotiation n Interaction between scheduling instances, between resource/agreement providers and agreement initiators (higher-level scheduler) è è n Need for combining agreements from different providers è n access to tentative information necessary negotiations might take very long individual scheduling objectives to be considered probably market-oriented and economic scheduling needed coordinate complex resource requests or workflows Maintain different negotiations at the same time è probable several levels of negotiations, agreement commitment and reservation 94

Accounting and Billing n Interaction to budget information Charging for allocations, reservations; preliminary allocation of budgets Concepts for reliable authorization of Grid schedulers to spend money on behalf of the user Re-funding in terms of resource/SLA failure, re-scheduling etc. n Reliable monitoring and accounting n n n è required for tracing whether a party fulfilled an agreement 95

Monitoring Services n Monitoring of è è è è n resource conditions agreements schedules program execution SLA conformance workflow … Monitoring must be reliable as it is part of accountability è fail or fulfillment of a service/resource provider must be clearly identifiable 96

Conclusions for Grid Scheduling Grids ultimately require coordinated scheduling services. n Support for different scheduling instances è è n For arbitrary resources è è n not only computing resources, also data, storage, network, software etc. Support for co-allocation and reservation è n different local management systems different scheduling algorithms/strategies necessary for coordinated grid usage (see data, network, software, storage) Different scheduling objectives è cost, quality, other 97

Scheduling Model Using a Brokerage/Trading strategy: Submit Grid Job Description Discover Resources Consider individual user policies Coordinate Allocations Higher-level scheduling Select Offers Collect Offers Query for Allocation Offers Generate Allocation Offer Lower-level scheduling Analyze Query Consider individual owner policies 98

Properties of Multi-Level Scheduling Model n n Multi-level scheduling must support different RM systems and strategies. Provider can enforce individual policies in generating resource offers. n User receive resource allocation optimized to the individual objective n Different higher-level scheduling strategies can be applied. Multiple levels of scheduling instances are possible Support for fault-tolerant and load-balanced services. n n 99

Negotiation in Grids n Multilevel Grid scheduling architecture è Lower level local scheduling instance l è Higher level Grid scheduling instance l n Resource selection and coordination (Static) Interface definition between both instances è è è n Implementation of owner policies Different types of resources Different local scheduling systems with different properties Different owner policies (Dynamic) Communication between both instances è è Resource discovery Job monitoring 100

Using Service Level Agreements n n The mapping of jobs to resources can be abstracted using the concept of Service Level Agreement (SLAs) (Czajkowski, Foster, Kesselman & Tuecke) SLA: Contract negotiated between è è n resource provider, e. g. local scheduler resource consumer, e. g. , grid scheduler, application SLAs provide a uniform approach for the client to è è è specify resource and Qo. S requirements, while hiding from the client details about the resources, such as queue names and current workload 101

Service Level Agreement Types n Resource SLA (RSLA) è A promise of resource availability l è n Advance Reservation is an RSLA Task SLA (TSLA) è A promise to perform a task l l n Client must utilize promise in subsequent SLAs Complex task requirements May reference an RSLA Binding SLA (BSLA) è Binds a resource capability to a TSLA l l è May reference an RSLA (i. e. a reservation) May be created lazily to provision the task Allows complex resource arrangements 102

Agreement-Based Negotiation n A client (application) submits a task to a Grid scheduler è n The client negotiates a TSLA for the task with the Grid Scheduler In order to provision the TSLA, the Grid Scheduler may obtain an RSLA with the Grid resource or may use a preexisting RSLA that the Grid scheduler has negotiated speculatively TSLA that refers to an RSLA assures the jobs gets the reserved resources at a specified time è TSLA without an RSLA tells little about when the resources will be available to the job è 103

Agreement-Based Negotiation (2) n The job starts execution on the resource according to the TSLA and the RSLA For an existing TSLA, the Grid Scheduler may obtain additional RSLAs è An RSLA is negotiated by the Grid Scheduler with the Resource è A BSLA binds this RSLA to the corresponding TSLA è n BSLAs allow to dynamically provision resources that are either not needed for the whole duration of the task or è not known completely (e. g. , the time at which a resource will be needed) before submitting the task è 104

Example of Agreement Mapping n User/ Application TSLA 2 TSLA 1 Grid Scheduler RSLA 1 n n The Grid Scheduler receives requests for two agreements. It negotiates with the resources the RSLA 1 and RSLA 2, and in parallel with the agreement initiators about the corresponding TSLA 1 and TSLA 2. RSLA 2 Grid Resource Manager Resource 105

GGF: GRAAP-WG n n Goal: Defining Web. Service-based protocols for negotiation and agreement management WS-Agreement Protocol: 106

Towards Grid Scheduling Methods: Support for individual scheduling objectives and policies è Multi-criteria scheduling models è Economic scheduling methods to Grids è Architectural requirements: Generic job description è Negotiation interface between higher- and lower-level scheduler è Economic management services è Workflow management è è Integration of data and network management 107

Grid Scheduling Strategies Current approach: n Extension of job scheduling for parallel computers. n Resource discovery and load-distribution to a remote resource n Usually batch job scheduling model on remote machine But actually required for Grid scheduling is: n Co-allocation and coordination of different resource allocations for a Grid job n Instantaneous ad-hoc allocation is not always suitable This complex task involves: è è Cooperation between different resource providers Interaction with local resource management systems Support for reservations and service level agreements Orchestration of coordinated resources allocations 108

User Objective Local computing typically has: è è è A given scheduling objective as minimization of response time Use of batch queuing strategies Simple scheduling algorithms: FCFS, Backfilling Grid Computing requires: è Individual scheduling objective l l l è better resources faster execution cheaper execution More complex objective functions apply for individual Grid jobs! 109

Provider/Owner Objective Local computing typically has: è è Single scheduling objective for the whole system: e. g. minimization of average weighted response time or high utilization/job throughput In Grid Computing: è Individual policies must be considered: l l l è è access policy, priority policy, accounting policy, and other More complex objective functions apply for individual resource allocations! User and owner policies/objectives may be subject to privacy considerations! 110

Grid Economics – Different Business Models n Cost model è è n Individual scheduling objective functions è è è n Use of a resource Reservation of a resource User and owner objective functions Formulation of an objective function Integration of the function in a scheduling algorithm Resource selection è The scheduling instances act as broker l Collection and evaluation of resource offers 111

Scheduling Objectives in the Grid n In contrast to local computing, there is no general scheduling objective anymore è è è n Cost and different service quality come into play è è n n minimizing response time minimizing cost tradeoff between quality, cost, response-time etc. the user will introduce individual objectives the Grid can be seen as a market where resource are concurring alternatives Similarly, the resource provider has individual scheduling policies Problem: è è è the different policies and objectives must be integrated in the scheduling process different objectives require different scheduling strategies part of the policies may not be suitable for public exposition (e. g. different pricing or quality for certain user groups) 112

Grid Scheduling Algorithms n Due to the mentioned requirements in Grids its not to be expected that a single scheduling algorithm or strategy is suitable for all problems. n Therefore, there is need for an infrastructure that è è allows the integration of different scheduling algorithms the individual objectives and policies can be included resource control stays at the participating service providers Transition into a market-oriented Grid scheduling model 113

Economic Scheduling n Market-oriented approaches are a suitable way to implement the interaction of different scheduling layers è è è n Needs for suitable scheduling algorithms and strategies for creating and selecting offers è n agents in the Grid market can implement different policies and strategies negotiations and agreements link the different strategies together participating sites stay autonomous need for creating the Pareto-Optimal scheduling solutions Performance relies highly on the available information è negotiation can be hard task if many potential providers are available. 114

Economic Scheduling (2) 1. Several possibilities for market models: 1. 2. n Offer-request mechanisms support: è è n auctions of resources/services auctions of jobs inclusion of different cost models, price determination individual objective/utility functions for optimization goals Market-oriented algorithms are considered: è è robust flexible in case of errors simple to adapt markets can have unforeseeable dynamics 115

Problem: Offer Creation Job t t 4 t 3 t 2 R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 t 1 t 0 116

Offer Creation (2) Offer 1 t End of current schedule R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 117

Offer Creation (3) Offer 2 t End of current schedule R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 118

Offer Creation (4) Offer 3 New end of current schedule t End of current schedule R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 119

Evaluate Offers n Evaluation with utility functions l A utility function is a mathematical representation of a user’s preference The utility function may be complex and contain several different criteria l Example using response time (or delay time) and price: l l 120

Optimization Space A 3 Im pro Uti ved lity latency A 3 A 2 A 1 price n The utility depends on the user and provider objectives 121

Example: NWIRE n Following are some slides to illustrate the scheduling process in the NWIRE project. n NWIRE models a Grid infrastructure without central services è n A Grid schedulers asks known other Schedulers. è è è n A P 2 P scheduling model is employed where Grid Scheduler/Managers that reside on each local site are queried for offers. Similar to P 2 P systems, no Grid manager knows the whole grid but keeps a list of “know other Grid managers” Those schedulers can forward the request to other schedulers The scope and depths of the query can be parameterized Individual objective functions are used to evaluate the utility for a user and the resource provider. è è The scheduler selects the allocation offer with the highest utility value The whole model represents a Grid market and auction system. 122

Evaluation of Economic Approach n n Real Workload Data of Grids is not yet available. However, available are workloads of single HPC installations. Example of evaluation results: n considered: traces from Cornell Theory Center (430 nodes IBM RS/6000 SP) n 4 workload adapted from the real traces, each 10000 jobs n Assumed Synthetic Grid machine configurations: Name Configuration Largest Maschine m 64 4*64 + 6*32 + 8*8 64 m 128 4*128 m 256 2*256 m 384 1*384 + 1*64 + 4*16 384 m 512 1*512 123

Scheduling Process 1) Client sends a request Requirements Job Attributes Objective. Function Grid Manager Application Resource Manager 124

Example Request KEY “Hops“ {VALUE “HOPS“ {2}} KEY “Max. Offer. Number“ {VALUE KEY “Min. Multi-Site“ {VALUE “Max. Offer. Number“ {10}} “Min. Multi-Site“ {10}} . . . KEY “Utility“ { ELEMENT 1 { CONDITION{((Operating. System EQ “Linux“)&&( Number. Of. Processors >= 8))} VALUE ‘‘Utility. Value‘‘ {-End. Time} VALUE ‘‘Run. Time‘‘ {5} }. . . ELEMENT 2 { CONDITION{. . . } VALUE ‘‘Utility. Value‘‘ {-Job. Cost} }. . . } 125

Scheduling Process 3) Limitations of Request direction and depth Request Grid Manager Request 2) Query for other offers Resource Manager n Request Requirements Job Attributes Objective. Function Grid Manager Application n Request Resource Manager Scheduler requests offers from other Grid Managers/Schedulers. Search depths and scope can be parameterized. 126

Scheduling Process Request Grid Manager Request Offer Attributes Objective. Function Grid Manager Application 4) Allocation; Maximization of the objectives Resource Manager n n Resource Manager Selection of allocation according to the utility value Returned offers are collected. 127

Scheduling Process Request Grid Manager Request Offer Schedule Allokation_1 Allokation_2 Allokation_3 Grid Manager Application 5) User and Resource Managers are informed about allocations Resource Manager n n Resource Manager The user can directly be informed about the schedule for his execution. The Grid Manager may re-schedule his allocations to optimize the result further. 128

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 129

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 130

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 131

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 132

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 133

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 134

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 135

Offer Creation D C Zeit B A 1 2 3 4 5 6 7 Resources 136

AWRT for Economic Scheduling Compared to Conservative Backfilling Conventional algoritms better Markted-oriented better 137

Utilization Markted-oriented better Conventional better 138

AWRT for different Machine Utility Functions 139

Results on Economic Models n n n Economical models provide results in the range of conventional algorithms in terms of AWRT Economic methods leave much more flexibilities in defining the desired resources Problems of site autonomy, heterogeneous resources and individual owner and user preferences are solved Advantage of reservation of resources ahead of time Further analysis needed: è Usage of other utility functions 140

Extending Resource Types n Data is a key resource in the Grid è è remote job execution will in most cases include the consideration of data Data is a resource that can be managed l l n The requirements on network information and Qo. S features will increase è è è n replication, caching pre-fetching, post-fetching scheduling and planning data and resources requires information about available network bandwidth how long does a reliable data transfer take? is sufficient bandwidth available between two resources (e. g between a visualization device and the compute resource Both resource types as well as others will need suitable management and scheduling features è negotiation and agreement protocols provide already an infrastructure 141

Data and Network Scheduling Most new resource types can be included via individual lower-level resource management systems. Additional considerations for n Data management è è n Network management è è Select resources according to data availability But data can be moved if necessary! Consider advance reservation of bandwidth or SLA Network resources usually depend on the selection of other resources! Problem: no general model for network SLAs. Coordinate data transfers and storage allocation 142

Data Management n n n Access to information about the location of data sets Information about transfer costs Scheduling of data transfers and data availability è optimize data transfers in regards to available network bandwidth and storage space n Coordination with network or other resources n Similarities with general grid scheduling: è è è access to similar services similar tasks to execute interaction necessary 143

Functional View of Grid Data Management Application Metadata Service Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Information Services Location based on data attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Source: Carl Kesselmann Storage Resources 144

Replica Location Service In Context n n n The Replica Location Service is one component in a layered data management architecture Provides a simple, distributed registry of mappings Consistency management provided by higher-level services Source: Carl Kesselmann 145

Example of a Scheduling Process Scheduling Service: 1. receives job description 2. queries Information Service for static resource information 3. prioritizes and pre-selects resources 4. queries for dynamic information about resource availability 5. queries Data and Network Management Services 6. generates schedule for job 7. reserves allocation if possible otherwise selects another allocation 8. delegates job monitoring to Job Supervisor Example: Job Supervisor/Network and Data Management: service, monitor and initiate allocation Data/network provided and job is started 40 resources of requested type are found. 12 resources are selected. 8 resources are available. Network and data dependencies are detected. Utility function is evaluated. 6 th tried allocation is confirmed. 146

Use-Case Example: Coordinated Simulation and Visualization Expected output of a Grid scheduler: resources time Data Access Storing Data Network 1 Data Transfer Computer 1 Network 2 Computer 2 Software License Storage Network 3 VR-Cave Loading Data Parallel Computation Providing Data Communication for Computation Parallel Computation Software Usage Data Storage Communication for Visualization 147

Re-Scheduling n Reconsidering a schedule with already made agreements may be a good idea from time to time è è n Optimization of the schedule can only work with the bounds of made agreements and reservations è n because resource situation may have changed, or workload situation has changed given guarantees must be observed The schedulers can try to maximize the utility values of the overall schedule è è a Grid scheduler may negotiate with other resource providers in order to get better agreements; may cancel previous agreements a local scheduler may optimize the local allocations to improve the schedule. 148

Activities n Core service infrastructure OGSI/WSRF è OGSA è n GGF hosts several groups in the area of Grid scheduling and resource management. Examples: è WG Scheduling Attributes (finished) è WG Distributed Resource Management Application API (active) è WG Grid Resource Allocation Agreement Protocol (active) è WG Grid Economic Services Architecture (active) è RG Grid Scheduling Architecture (active) 149

Conclusion n Resource management and scheduling is a key service in an Next Generation Grid. è è n System integration is complex but vital. è è n The local systems must be enabled to interact with the Grid. Providing sufficient information, expose services for negotiation Basic research is still required in this area. è è n In a large Grid the user cannot handle this task. Nor is the orchestration of resources a provider task. No ready-to-implement solution is available. New concepts are necessary. Current efforts provide the basic Grid infrastructure. Higher-level services as Grid scheduling are still lacking. è è Future RMS systems will provide extensible negotiation interfaces Grid scheduling will include coordination of different resources 150

Further Outlook Commercial Interest in the Grid may have stronger focus on non HPCrelated aspects n e. Commerce-Solutions è n Inter-operation between companies è è n Dynamically migrating Application Server Load Supply-Chain Management request distributed resources under certain constraints Enterprise Application Integration EAI è dynamically discover and use available services The problems tackled by Grid RMS is common to many application fields. è è Grid solutions may become a key technology infrastructure or solutions from other areas (e. g. the Web. Service-Community) may surpass Grid efforts? 151

References n Book: “Grid Resource Management: State of the Art and Future Trends”, co-editors Jarek Nabrzyski, Jennifer M. Schopf, and Jan Weglarz, Kluwer Publishing, 2004 n PBS, PBS pro: www. openpbs. org and www. pbspro. com LSF, CSF: www. platform. com Globus: www. globus. org Global Grid Forum: www. ggf. org, see SRM area n n n 152

Contact: ramin. yahyapour@udo. edu Computer Engineering Institute University of Dortmund 44221 Dortmund, Germany http: //www-ds. e-technik. uni-dortmund. de 153