Workload Management WP Status and next steps Massimo

  • Slides: 16
Download presentation
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova

Where we are n CMS-HLT use case (Monte Carlo production and reconstruction) analyzed in

Where we are n CMS-HLT use case (Monte Carlo production and reconstruction) analyzed in terms of GRID requirements and GRID tools availability n n Discussions with Globus team and Condor team Definition of a prototype architecture of workload management system n n Use of Globus and Condor mechanisms But major developments needed

Prototype workload management system architecture Submit jobs Resource Discovery Master Globus GRAM as uniform

Prototype workload management system architecture Submit jobs Resource Discovery Master Globus GRAM as uniform interface to different local resource management systems Local Resource Management Systems Farms Site 1 Grid Information Service (GIS) condor_submit (Globus Universe) Master chooses in which Globus resources the jobs must be submitted Condor-G able to provide a reliable/crashproof job submission service Info Condor-G Globus GRAM CONDOR LSF PBS Site 2 Site 3

Where we are n n Evaluating the existing components (D 1. 1) and “putting

Where we are n n Evaluating the existing components (D 1. 1) and “putting together” the various building blocks Evaluation of Globus n n Collaboration with WP 1 of INFN-GRID project (Evaluation of the Globus toolkit) http: //www. infn. it/globus Evaluation of Globus GRAM n n Evaluation of Condor-G n The current implementation is a prototype n n GRAM as uniform interface to different underlying resource management systems Evaluation of RSL “Cooperation” between GRAM and GIS It works, but some problems must be solved Globus + Condor-G tested with a real CMS MC production n Many many memory leaks found in the Globus jobmanager !!! n Fixes (provided by Francesco Prelz) submitted to Globus team n Feedback only for what concerning the bugs in the GAA and GSS modules (new fixes “merged” with the original ones)

First deliverables n n n Month 3: Report on current technology (report) D 1.

First deliverables n n n Month 3: Report on current technology (report) D 1. 1 Month 6: Definition of architecture for scheduling, resource management, security and job description (report) D 1. 2 Month 9: Components and documentation for the 1 st release: initial workload management system (prototype) D 1. 3

Proposed work plan n Let’s continue the implementation of the proposed prototype n n

Proposed work plan n Let’s continue the implementation of the proposed prototype n n n Evaluation of current technologies (Globus, Condor) (D 1. 1) Functionalities for the 1 st release First release n n We can propose the functionalities that could be implemented “Negotiation” in the ATF n n n To understand if these functionalities “address” the proposed use cases To understand if our module can be “plugged” together with the other “pieces” To understand if the other WPs can provide the required (by WP 1) functionalities

Proposed functionalities for the 1 st release n n n First version of job

Proposed functionalities for the 1 st release n n n First version of job description language (JDL) First version of broker (master), that decides where to submit the jobs Job submission service First version of logging and bookkeeping services First user interface

Job Description Language (JDL) n Used when the job is submitted, to specify n

Job Description Language (JDL) n Used when the job is submitted, to specify n n The application The input data set n File ? Collection of files ? “Logical” or “physical” names ? n Need to be discussed with WP 2, WP 8, ATF n Where the output data must be saved (Required and preferable) resources Info for bookkeeping … ? ? ? n Prototype: Condor Class. Ads n n n

Broker/Master n Choice of resource (farm) where to submit job n n n Input:

Broker/Master n Choice of resource (farm) where to submit job n n n Input: JDL expression Output: computing resource choice Published resource access lists (gridmap -files in the Globus-based prototype) are checked as a first step in the resource match-making

Broker/Master n The “accessible” computing resources are matched with the job request according to:

Broker/Master n The “accessible” computing resources are matched with the job request according to: n Availability of the requested input data set n n Availability of the appropriate application "sandbox“ n n If necessary, it could be necessary to "copy" and install this sandbox if not already available in the executing farm (“code migration”) (in the 1 st release ? ? ? ) Queue characteristics and status (architecture, etc…) vs. job requests n n In the 1 st release the broker will have to choose a resource where this input data set is already available (we are not going to “trigger” the replica of the input data set) Let’s start with a few, simple parameters Availability of the requested amount of scratch space

Broker/Master n n We assume that all the information needed by the broker are

Broker/Master n n We assume that all the information needed by the broker are “published” in one “Grid Information Space” (GIS in the Globus-based prototype) by the other WPs Prototype: Condor matchmaking library n n Match between the info published in the GIS and the Class. Ads defined in the JDL Necessary a “translator” GIS attributes Class. Ads n Some work already done by Globus team ? ? ?

Job submission service n n Input: job to submit + computing resource choice (provided

Job submission service n n Input: job to submit + computing resource choice (provided by broker) Reliable, fault tolerant, crash proof service n n Reliability in the executing machines up to WP 4 Prototype: Condor-G n n Submission of jobs to Globus resources (farms) New implementation of Condor-G (+ new Globus job manager) available soon

“Code” migration n Not easy at all !!! n n n Necessary to “install”

“Code” migration n Not easy at all !!! n n n Necessary to “install” in the target farm a complex run time environment Necessary a STRONG collaboration with WP 8 (and WP 4) to define an “application sandbox”, that can easily be installed in one farm, and doesn’t “conflict” with other sandboxes Use of “application repositories” ? ? ? n When an application must be installed on one farm, the sandbox is downloaded from such repository

Bookkeeping n Necessary to “record” for each job n n n n Submitting user

Bookkeeping n Necessary to “record” for each job n n n n Submitting user identity Input data Output data Status of processing Where and when the processing has been done Other bookkeeping info specified in the JDL …? ? ?

Logging n Necessary to keep tracks of the significant events occurred in the system

Logging n Necessary to keep tracks of the significant events occurred in the system n n Requests by users Computing resource choice (by broker) Submission to resource …? ? ?

User Interface n Job management n n n Job submission Job removal Job status

User Interface n Job management n n n Job submission Job removal Job status monitoring Access to bookkeeping info Access to logging info …? ? ?