Globus activities within INFN Massimo Sgaravatto INFN Padova
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group globus@infn. it
Globus activities within INFN n n WP “Installation and Evaluation of the Globus Toolkit” of the INFN-GRID Project Goal: evaluate the Globus toolkit as a GRID framework providing basic services n n Which services can be useful ? What is necessary to integrate/modify ? What is missing ? Duration: 6 months n Results of this first evaluation used to plan future activities
Tasks n n n n Security Information Service Resource Manager Globus deployment Data Access and Migration Fault Monitoring Execution Environment Management
Status Globus installed on ~ 35 machines in 11 sites TRENTO TORINO UDINE MILANO PADOVA LNL PAVIA TRIESTE FERRARA GENOVA PARMA CNAF BOLOGNA PISA FIRENZE S. Piero PERUGIA LNGS ROMA L’AQUILA ROMA 2 LNF SASSARI NAPOLI BARI LECCE SALERNO CAGLIARI COSENZA PALERMO CATANIA LNS
Security (GSI) n Already done: n Evaluation of the Globus security architecture n We like the general architecture, but: n n Granting local "identities" based only on certificate subjects allows the existence of multiple valid certificates for the same subject Authentication library not in sync with Open. SSL development Cryptic diagnostics (e. g. "certificate chain too long" when the CA policy check fails) Globus certificates (for hosts and users) signed by INFN certification authority
Security (GSI) n To do: n Definition and implementation of architecture of CAs n n Up to task force of the Data. Grid project Make certificate requests easier Periodic update of CRL “Management” of grid-mapfile updates n I. e. : a certain Globus resource must be available to all members of a specific physics group
Information Service (GIS) n Already done: n INFN MDS server serving Globus 1. 1. 1 and 1. 1. 2 installations n n n Lot of problems using the “default” American MDS server Definition and implementation of test architecture of GIS (for Globus 1. 1. 3) Web interface for browsing
GIS Architecture (test phase) Implemented Top Level INFN GIIS Dc=infn, dc=it, o=grid Implemented using INFNGRID distribution To be implemented Exp=atlas, o=grid Dc=bo, Dc=infn, dc=it, o=grid Dc=mi, Dc=infn, GIIS dc=it, o=grid GIIS Bologna Milano INFN ATLAS GIIS GRIS
Information Service (GIS) n To do: n n Netscape LDAP server as Top level INFN GIIS Tests on performance and scalability n n n Results used to define and implement the GIS architecture Review the information gathered from the various machines and published in the GIS Other tools and interfaces for Grid users and administrators
Resource Management (GRAM) n Already done: n Job submission tests using Globus tools (globusrun, globus-job-submit) n n n GRAM as uniform interface to different underlying resource management systems (LSF, Condor, PBS) Some bugs found and fixed n Standard output and error for vanilla Condor jobs n globus-job-status n … Some bugs can be solved without major re-design and/or reimplementation: n For LSF the RSL parameter (count=x) is translated into: bsub –n x … n n n Should be: bsub … … Two major problems: n n Scalability Fault tolerance x times
Globus GRAM Architecture Client pc 1 Globus front-end machine pc 2 pc 1% globusrun –b –r pc 2. pd. infn. it/jobmanager-xyz –f file. rsl file. rsl: & (executable=/disk. Cms/startcmsim. sh) (stdin=/disk. Cms/Pythia. Out/filename (stdout=/disk. Cms/Cmsim/filename) (count=1) LSF/ Condor/ PBS/ … Jobmanager Job
Scalability n n One jobmanager for each globusrun If I want to submit 1000 jobs ? ? ? n 1000 globusrun n n 1000 jobmanagers running in the front-end machine !!! %globusrun –b –r pc 2. infn. it/jobmanager-xyz –f file. rsl: & (executable=/disk. Cms/startcmsim. sh) (stdin=/disk. Cms/Pythia. Out/filename) (stdout=/disk. Cms/Cmsim. Out/filename) (count=1000) n It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … n n n $(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !
Fault tolerance n n n The jobmanager is not persistent If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed Example of problem n n n Submission of n jobs on a cluster managed by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart n Orphan jobs Globus assumes that the jobs have been successfully completed
Resource Management (GRAM) n Already done: n n Submission of Condor jobs to Globus resources (Condor-G and Glide. In mechanisms) Evaluation of RSL as uniform language to specify resources n n The RSL syntax model seems suitable to define even complicated resource specification expressions The common set of RSL attributes is often not sufficient n n More flexibility is required n n The attributes not belonging to the common set are ignored Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach
Resource Management (GRAM) n Already done: n “Cooperation” between GRAM and GIS n The information on characteristics and status of local resources and on jobs is not enough n n n As local resources we must consider Farms and not the single workstations Other information (i. e. total and available CPU power) needed The default schema must be integrated with other info provided by the underlying resource management systems or by specific agents
GRAM & Condor & GIS
GRAM & LSF & GIS Must be fixed
Jobs & GIS n Info on Globus jobs published in the GIS: n User n n n Subject of certificate Local user name RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…
Resource Management (GRAM) n To do: n n Tests with GRAM API Tests with real applications and real environments (CMS fall production) n Already started n n n Memory leak in the job manager ? !? !? !? Solve the problems Identity a set of useful attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and integrate the default schema n n Let’s start with information provided by the underlying resource management system Second step: specific agents
Globus deployment n GRID Tools to enable local administrators to deploy the GRID software (now Globus 1. 1. 3 and related packages: Open. LDAP, …) n n Reduce complexity and manpower necessary for installation Decrease errors during installations Collect bug fixes Include INFN customizations n Certificates (for hosts and users) signed by INFN CA n n … but user certificates signed by Globus CA are accepted as well Preliminary architecture for GIS
First step (July 2000) n Software distribution available on AFS n n Fixes for bugs found during first Globus evaluations included INFNGRID installation guide n n GRID Instructions for INFN customizations included Scripts to make certain steps (i. e. postinstall operations) automatic
Second step (now) n n GRID Pre-compiled distribution (available now for Linux Red Hat 6. 1): INFNGRID 1. 1 Script for installation and deployment: infngrid-install n Users decide to use INFN customizations or “standard” setup Would you like the INFN setup (Y/N) ? (1) Copy INFNGRID tar files from /afs/infn. it/project/infngrid/1. 1/Linux to download dir (2) Decompress and untar INFNGRID distribution files in install dir (3) Configure INFNGRID software (4) Globus Setup (5) Configure GRAM services Condor and LSF (6) Globus local deploy (7) GIIS Configuration ==========================
Second step n Script for post install operations: globus-root-setup (1) (2) (3) (4) (5) n Modify system files and reactive the inetd daemon Change owner to root of certain files for tighter security Modify system wide login files Start/restart Globus now Configure gsi-wuftpd and restart the inetd daemon Installation instructions for special environments (configuration of client machines, shared install-directory) included n n GRID List of included bug fixes Status n n n Tests performed in different environments (INFN, CERN, FNAL) “Officially” released Available to DATAGRID partners
Next steps n n Configuration of PBS as local resource management systems: 1. 2 Support for Solaris 2. 6: 1. 2 n n n n GRID We don’t plan (at least now) to support other platforms Improvement of current no-precompiled distribution Eventual use of infngrid-istall script for both pre-compiled and non pre-compiled distribution “Unattended” installation Management of updates Inclusion of GDMP: 1. 2 Inclusion of other GRID software packages ? ? Other works will be “triggered” by local administrators and users
Data Management n Already done: n n Preliminary tests with GASS and gsiftp To do: n Tests with Globus. FTP and Replica Catalog Software (Globus Data Grid Alpha Release 2)
GARA CISCO 7500 CISCO 7200 sunlab 3 sunlab 2 VC 100 Mbps FE Client FE Server GARA API GARA Network Resource Manager n Preliminary tests considering both network and CPU advance reservation
Other tasks n Fault Monitoring (HBM) n n Evaluation of HBM for fault detection (for “system” and “user” processes) Data collectors (implementing automatic recovery mechanisms) … but the HBM package is not seeing active development Execution Environment Management (GEM) n n Evaluation of GEM as service for code migration … but the GEM service now provides only limited capabilities (executable staging)
Other info n http: //www. pd. infn. it/~sgaravat/ INFN-GRID/Globus
- Slides: 28