Enabling Grids for Escienc E Job Submission Fokke
Enabling Grids for E-scienc. E Job Submission Fokke Dijkstra Ru. G/SARA Grid tutorial Utrecht September 2007 www. eu-egee. org EGEE-II INFSO-RI-031688 EGEE and g. Lite are registered trademarks
Contents Enabling Grids for E-scienc. E • The LCG Workload Management System (WMS) in g. Lite • Job Submission to EGEE – Job Preparation – A simple example & Job Lifecycle – Job Description Language (JDL) – Job Submission & Monitoring – Some more advanced topics EGEE-II INFSO-RI-031688
WMS Enabling Grids for E-scienc. E ? EGEE-II INFSO-RI-031688
The LCG WMS Enabling Grids for E-scienc. E • The user submits jobs via the Workload Management System • The Goal of WMS is the distributed scheduling and resource management in a Grid environment. • What does it allow Grid users to do? To submit their jobs To execute them To get information about their status To retrieve their output • The WMS tries to – Optimize the usage of resources – Execute user jobs as fast as possible EGEE-II INFSO-RI-031688
WMS components Enabling Grids for E-scienc. E JDL LCG File Catalog (LFC) Information System (BDII) User Interface (UI) Resource Broker (RB) Logging & Bookkeeping (LB) Storage Element (SE) Job Submission Service (JSS) Computing Element (CE) EGEE-II INFSO-RI-031688
Job Preparation Enabling Grids for E-scienc. E • You need to provide – A complete (enough) job description § What program? § What data? § Any requirements on OS, installed software, ? ? – Possibly a program § § You’re submitting in unknown territory! Program portably! Don’t rely on hard-coded paths or special locations The program you send may not even be in $HOME! – Perhaps some input data – Perhaps instructions on what to do with the output EGEE-II INFSO-RI-031688
How to Write a Job Description Enabling Grids for E-scienc. E • Here is a minimal job description (call it hello. jdl) Executable = “/bin/echo”; Arguments = “Goedemiddag”; Std. Error = “stderr. log”; Std. Output = “stdout. log”; Output. Sandbox = {“stderr. log”, “stdout. log”}; • We specified – The program to run and its arguments – Directed the standard error and output streams to files – Told it what to do with the output EGEE-II INFSO-RI-031688
Job Submission Example Enabling Grids for E-scienc. E • User issues a voms-proxy-init – enters his certificate’s password – Receives a valid Globus proxy • User issues a: edg-job-submit mytest. jdl and gets back from the system a unique Job Identifier (Job. Id) • User issues a: edg-job-status Job. Id to get logging information about the current status of his Job • When the “Output. Ready” status is reached, the user can issue a edg-job-get-output Job. Id and the system returns the name of the temporary directory where the job output can be found on the UI machine. EGEE-II INFSO-RI-031688
Submitting it Enabling Grids for E-scienc. E $ voms-proxy-init --voms tutor Your identity: /O=edgtutorial/O=users/O=rug/OU=rc/CN=Fokke Dijkstra Enter GRID pass phrase: Creating temporary proxy. . . . Done Contacting mu 4. matrix. sara. nl: 30007 [/O=dutchgrid/O=hosts/OU=sara. nl/CN=mu 4. matrix. sara. nl] "tutor" Done Creating proxy. . . Done Your proxy is valid until Mon Sep 11 23: 22: 12 2006 $ edg-job-submit hello. jdl Selected Virtual Organisation name (from UI conf file): tutor Connecting to host mu 3. matrix. sara. nl, port 7772 Logging to host mu 3. matrix. sara. nl, port 9002 *************************************** * JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_job. Id) is: - https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q *************************************** * Job. Id EGEE-II INFSO-RI-031688
A Job Submission Example UI JDL LCG File Catalog (LFC) Input Sandbox Job Status submitted Information System (IS) waiting User Interface (UI) Job Submit Event Resource Broker (RB) Storage Element (SE) Job Status Logging & Bookkeeping (LB) Job Submission Service (JSS) Computing Element (CE)
Checking the status Enabling Grids for E-scienc. E $ edg-job-status https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q ******************************* BOOKKEEPING INFORMATION: Status info for the Job : https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: mu 6. matrix. sara. nl: 2119/jobmanager-pbs-long reached on: Tue Jun 1 08: 14: 25 2004 ******************************* EGEE-II INFSO-RI-031688
A Job Submission Example LCG File Catalog (LFC) UI JDL Job Status submitted Information System (IS) waiting ready User Interface (UI) Input Sandbox scheduled Resource Broker (RB) running Job Status Logging & Bookkeeping (LB) Broker. Info Job Status Storage Element (SE) done Job Submission Service (JSS) Job Status Output Sandbox outputready Computing Element (CE)
Getting the Output Enabling Grids for E-scienc. E $ edg-job-get-output https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q Retrieving files from host: mu 3. matrix. sara. nl ( for https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q ) *************************************** * JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https: //mu 3. matrix. sara. nl: 9000/Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q have been successfully retrieved and stored in the directory: /tmp/job. Output/fokke_Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q *************************************** * $ cat /tmp/job. Output/fokke_Nz 6 PWWJCjt. T 7 YY 3 PJWDu 5 Q/std. out Goedemiddag EGEE-II INFSO-RI-031688
A Job Submission Example LCG File Catalog (LFC) UI JDL Job Status submitted Information System (IS) waiting ready Output Sandbox Resource Broker (RB) running Storage Element (SE) done Job Status Logging & Bookkeeping (LB) scheduled Job Submission Service (JSS) outputready Computing Element (CE) cleared
Job Description Language (JDL) Enabling Grids for E-scienc. E • Based upon Condor’s CLASSified ADvertisement language (Class. Ad) • Class. Ad is an extensible language • Sequence of attributes (key, value pairs) separated by semi-colons. Executable = “/bin/echo”; Arguments = “Goedemiddag”; Std. Error = “stderr. log”; Std. Output = “stdout. log”; Output. Sandbox = {“stderr. log”, “stdout. log”}; EGEE-II INFSO-RI-031688
Types of Attributes Enabling Grids for E-scienc. E • The supported attributes are grouped in two categories: – Job Define the job itself – Resources § Taken into account by the RB for carrying out the matchmaking algorithm § Computing Resource (Attributes) Used to build expressions of Requirements and/or Rank attributes by the user Have to be prefixed with “other. ” § Data and Storage resources (Attributes) Input data to process, SE where to store output data, protocols spoken by application when accessing SEs EGEE-II INFSO-RI-031688
Job Definition Attributes Enabling Grids for E-scienc. E • Executable (mandatory) – The command name • Arguments (optional) – Job command line arguments • Std. Input, Std. Output, Std. Err (optional) – Standard input/output/error of the job • Environment (optional) – List of environment settings • Input. Sandbox (optional) – List of files on the UI local disk needed by the job for running – The listed files are staged from the UI to the remote CE • Output. Sandbox (optional) – List of files, generated by the job, which have to be retrieved EGEE-II INFSO-RI-031688
Resource Attributes Enabling Grids for E-scienc. E • Requirements – Job requirements on computing resources – Specified using attributes of resources published in the Information System – If not specified, default value defined in UI configuration file is considered § Default: other. Glue. CEState. Status == "Production" (the resource has to be in the Production grid) • Rank – Expresses preference (how to rank resources that have already met the Requirements expression) – Specified using attributes of resources published in the Information Service – If not specified, default value defined in the UI configuration file is considered § Default: - other. Glue. CEState. Free. CPUs (the highest number of free CPUs) EGEE-II INFSO-RI-031688
“Data” Attributes Enabling Grids for E-scienc. E • Input. Data (optional) – Refers to data used as input by the job: these data are published in the Replica Catalog and stored in the SEs) – PFNs and/or LFNs • Data. Access. Protocol (mandatory if Input. Data specified) – The protocol or the list of protocols which the application is able to speak with for accessing Input. Data on a given SE • Output. SE (optional) – The hostname of the output SE – RB uses it to choose a CE that is compatible with the job and is close to SE • Output. Data (optional) – Output Data that will be registered at the end of the job EGEE-II INFSO-RI-031688
Example JDL File Enabling Grids for E-scienc. E Executable = “grid. Test”; Std. Error = “stderr. log”; Std. Output = “stdout. log”; Input. Sandbox = {“/home/joda/test/grid. Test”}; Output. Sandbox = {“stderr. log”, “stdout. log”}; Input. Data = “lfn: /grid/tutor/testbed 0 -00019”; Data. Access. Protocol = “gridftp”; Requirements = other. Architecture==“INTEL” && other. Op. Sys==“Cent. OS” && other. Free. Cpus >=4; Rank = “other. Glue. Host. Benchmark. SF 00”; EGEE-II INFSO-RI-031688
Job Submission Enabling Grids for E-scienc. E • edg-job-submit [–r <res_id>] [–n <user e-mail address>] [-c <config file>] [-o <output file>] <job. jdl> -r the job is submitted by the RB directly to the computing element identified by <res_id> -c the configuration file <config file> is used by the UI instead of the standard configuration file -o the generated edg_job. Id is written in the <output file> Useful for other commands, e. g. : edg-job-status –i <input file> (or edg_job. Id) -i the status information about edg_job. Id contained in the <input file> are displayed --vo the VO under which the job will be run EGEE-II INFSO-RI-031688
Other WMS UI Commands Enabling Grids for E-scienc. E • edg-job-list-match Lists resources matching a job description Performs the matchmaking without submitting the job • edg-job-cancel Cancels a given job • edg-job-status Displays the status of the job • edg-job-get-output Returns the job-output (the Output. Sandbox files) to the user • edg-job-get-logging-info Displays logging information about submitted jobs (all the events “pushed” by the various components of the WMS) Very useful for debug purposes EGEE-II INFSO-RI-031688
WMS Match Making Enabling Grids for E-scienc. E • The RB is the core component of WMS. • It has to find the best suitable computing resource (CE) where the job will be executed • It interacts with Data Management service and Information System They supply RB with all the information required for the resolution of the matches • The CE chosen by RB has to match the job requirements (e. g. runtime environment, data access requirements, and so on) • If 2 or more CEs satisfy all the requirements, the one with the best Rank is chosen EGEE-II INFSO-RI-031688
Direct Job submission Enabling Grids for E-scienc. E • The RB has to deal with three possible scenarios. Scenario 1: Direct Job Submission § Job is scheduled on a given CE (specified in the edg-jobsubmit command via –r option) § RB doesn’t perform any matchmaking algorithm § Take care if Input. Data is specified! EGEE-II INFSO-RI-031688
Brokered Job Submission, No Input. Data Enabling Grids for E-scienc. E Scenario 2: Job Submission without data-access Requirements § Neither CE nor input data are specified. § RB starts the matchmaking algorithm, which consists of two phases: • Requirements check (RB contacts the IS to check which CEs satisfy all the requirements) • If more than one CE satisfies the job requirements, the CE with the best rank is chosen by the RB EGEE-II INFSO-RI-031688
Brokered Job Submission, Grid Data Enabling Grids for E-scienc. E Scenario 3: CE is not specified in the JDL § RB contacts Data Management service to find out which SE’s have copies of the requested input data sets § RB makes best effort match between • Computing resources for which user is authorized • SE’s “nearby” which can provide the requested data sets via the requested transfer protocol • Any optional output SE specified in the job description § RB strategy consists of submitting jobs close to data! § The main two phases of the match making algorithm remain unchanged: • Requirements check • Rank computation § The matchmaking is only performed for CEs satisfying the data-access requirements (i. e. which are close to data) EGEE-II INFSO-RI-031688
Proxy Renewal Enabling Grids for E-scienc. E • Why? – To avoid job failure because it outlived the validity of the initial proxy • WMS support automatic proxy renewal mechanism as long as the user credentials are handled by a proxy server. 1. Create a proxy using voms-proxy-init 2. Register this proxy with the My. Proxy server using myproxy-init –s <server> [-t <cred> -c <proxy>] –d -n server is the server address (e. g. px. matrix. sara. nl) cred is the number of hours the proxy should be valid on the server proxy is the number of hours renewed proxies should be valid 3. Short term proxies can then be used to start jobs using grid-proxy-init –hours <hours> command 4. The Proxy is automatic renewed by WMS without user intervention for all the job life EGEE-II INFSO-RI-031688
MPI jobs Enabling Grids for E-scienc. E • MPI – Message passing – Link with parallel library – Run on multiple processors • g. Lite – Limited support – Some sites can run MPI jobs • Job. Type – – Job. Type=”MPICH”; Node. Number = 8; Adds MPICH support as requirement Executable run in paralllel on 8 CPU’s EGEE-II INFSO-RI-031688
Other Job. Types Enabling Grids for E-scienc. E • Interactive – Std. Output, Std. Input and Std. Error forwarded to user – default X window – Other tools • Checkpointable – Job must save checkpoints – Checkpoints can be retrieved – Not fully supported yet EGEE-II INFSO-RI-031688
Further Information Enabling Grids for E-scienc. E • The g. Lite User Guide! http: //glite. web. cern. ch/glite/documentation/default. asp • Class. Ad https: //www. cs. wisc. edu/condor/classad/ • Sara Grid pages http: //www. sara. nl/userinfo/grid/ EGEE-II INFSO-RI-031688
- Slides: 30