Workload Management Workpackage Massimo Sgaravatto INFN Padova Overview

Overview n Goal: define and implement a suitable architecture for distributed scheduling and resource

Overview n Many challenging issues : n n n n n Optimizing the choice

Tasks n Job resource specification and job description n Method to define and publish

Tasks n Scheduling n n Definition and implementation of scheduling policies to find the

Effort breakdown (mm) INFN Funded Unfunded 216 184 400 0 108 DATAMAT 108 CESnet

Workload Management in the INFN-GRID project n Integration, adaptation and deployment of middleware developed

First Activities and Results n CMS-HLT use case (Monte Carlo production) analyzed in terms

High throughput workload management system architecture (simplified design) Submit jobs (using Class-Ads) Resource Discovery

First Activities and Results n On going activities in putting together the various building

GIS Architecture (test phase) Implemented Top Level INFN GIIS Dc=infn, dc=it, o=grid Implemented using

First Activities and Results n Evaluation of Globus GRAM n Tests with job submissions

First Activities and Results n Evaluation of Condor-G n It works, but some problems

Some next steps n Tests with real applications and real environments n n n

Slides: 15

Download presentation

Workload Management Workpackage Massimo Sgaravatto INFN Padova

Overview n Goal: define and implement a suitable architecture for distributed scheduling and resource management in a GRID environment n Large heterogeneous environment n n n PC farms and not supercomputers used in HEP Large numbers (thousands) of independent users in many different sites Different applications with different requirements n HEP Monte Carlo productions, reconstructions and production analyses n n n HEP individual physics analyses n n n “Scheduled” activities Goal: throughput maximization … “Chaotic”, non-predictable activities Goal: latency minimization

Overview n Many challenging issues : n n n n n Optimizing the choice of execution location based on the availability of data, computation and network resources Optimal co-allocation and advance reservation of CPU, data, network Uniform interface to possible different local resource management systems Priorities, policies on resource usage Reliability Fault tolerance Scalability … INFN responsibility in Data. Grid

Tasks n Job resource specification and job description n Method to define and publish the resources required by a job Job control language (submission language, API, GUI) Partitioning programs for parallel execution n “Decomposition” of single jobs in multiple, “smaller” jobs that can be executed in parallel n Exploitation of task and data parallelism

Tasks n Scheduling n n Definition and implementation of scheduling policies to find the best match between job requirements and available resources Co-allocation and advance reservation Resource management Services n Bookkeeping, accounting, logging, authentication, authorization

Effort breakdown (mm) INFN Funded Unfunded 216 184 400 0 108 DATAMAT 108 CESnet 72 72 144 PPARC 0 18 18 396 274 670

Workload Management in the INFN-GRID project n Integration, adaptation and deployment of middleware developed within the Data. Grid project n n GRID software must enable physicists to run their jobs using all the available GRID resources in a “transparent” way HEP applications classified in 3 different “classes”, with incremental level of complexity n Workload management system for Monte Carlo productions n n n Workload management system for data reconstruction and production analysis n n n Goal: throughput maximization Implementation strategy: code migration (moving the application where the processing will be performed) Goal: throughput maximization Implementation strategy: code migration + data migration (moving the data where the processing will be performed, and collecting the outputs in a central repository) Workload management system for individual physics analysis n n n “Chaotic” processing Goal: latency minimization Implementation strategy: code migration + data migration + remote data access (accessing data remotely) for client/server applications

First Activities and Results n CMS-HLT use case (Monte Carlo production) analyzed in terms of GRID requirements and GRID tools availability n Discussions with Globus team and Condor team n n Good and productive collaborations already in place Definition of a possible high throughput workload management system architecture n n Use of Globus and Condor mechanisms But major developments needed

High throughput workload management system architecture (simplified design) Submit jobs (using Class-Ads) Resource Discovery Master Grid Information Service (GIS) condor_submit (Globus Universe) Master chooses in which Globus resources the jobs must be submitted Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Other info Information on characteristics and status of local resources Condor-G globusrun Globus GRAM as uniform interface to different local resource management systems Local Resource Management Systems Farms Site 1 Globus GRAM CONDOR LSF PBS Site 2 Site 3

First Activities and Results n On going activities in putting together the various building blocks n Globus deployment n n n Installed on ~ 35 hosts in 11 different sites INFNGRID distribution toolkit to make Globus deployment easier and more automatic INFN customizations n n User and host certificates signed by INFN CA Preliminary architecture for GIS

GIS Architecture (test phase) Implemented Top Level INFN GIIS Dc=infn, dc=it, o=grid Implemented using INFNGRID distribution To be implemented Exp=atlas, o=grid Dc=bo, Dc=infn, dc=it, o=grid Dc=mi, Dc=infn, GIIS dc=it, o=grid GIIS Bologna Milano INFN ATLAS GIIS GRIS

First Activities and Results n Evaluation of Globus GRAM n Tests with job submissions on remote resources n n n Evaluation of Globus RSL as uniform language to describe resources n n Globus GRAM as uniform interface to different underlying resource management systems (LSF, Condor, PBS) Problems related with scalability and fault tolerance (Globus jobmanager) More flexibility is required (Condor Class. Ads model) “Cooperation” between GRAM and GIS n The information on characteristics and status of local resources and on jobs is not enough n As local resources we must consider Farms and not the single workstations

First Activities and Results n Evaluation of Condor-G n It works, but some problems must be fixed: n n n Very difficult to understand about errors Problems with log files Problems with scalability in the submitting machine Condor-G is not able to provide fault tolerance and robustness (because Globus doesn’t provide these features) n Fault tolerance only in the submitting side Condor team is already working to fix some of these problems

Some next steps n Tests with real applications and real environments n n n CMS fall production Evaluation of the new Condor-G implementation (when ready) GIS – Class. Ads translator (Globus team ? ) Face the problems !! Master development !!!

Other info n http: //www. infn. it/grid