Grid di produzione INFN GRID Cristina Vistoli INFNCNAF

  • Slides: 18
Download presentation
Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid 25

Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid 25 -27 ottobre 2004 Bari – n° 1

Summary u INFN-GRID u Release u Resources, u Services u Supported u Basic VOs;

Summary u INFN-GRID u Release u Resources, u Services u Supported u Basic VOs; tests before joining the grid; u Certification u Calendar and Ticketing System; u Certification u Grid. AT and periodic tests activity; queue; (Grid Application Test); – n° 2

INFN-GRID Release u INFN-GRID n n is a customized release of LCG All resources

INFN-GRID Release u INFN-GRID n n is a customized release of LCG All resources are fully managed via LCFGng; INFN-GRID does not support the middleware installation without LCFGng; u INFN-GRID 2. 2. 0 release is based upon the official LCG-2. 2. 0 and it is 100% compatible; – n° 3

INFN-GRID Release u Main differences from LCG 2. 2. 0 to INFN-GRID 2. 2.

INFN-GRID Release u Main differences from LCG 2. 2. 0 to INFN-GRID 2. 2. 0: n Added support for DAG jobs; n Added support for AFS on the Worker. Nodes; n n Added support for MPI jobs via home syncronisation with ssh; Documented installation of WNs on a private network; u Added n full function VOMS support: INFNGRID, CDF are completely managed via VOMS server. – n° 4

INFN-GRID: Resources and supported VOs (**) Hyperthreaded – n° 5

INFN-GRID: Resources and supported VOs (**) Hyperthreaded – n° 5

CPU versus VO – n° 6

CPU versus VO – n° 6

INFN-GRID: Production Grid service Service Resources are open to all VOs supported RB-BDII scope

INFN-GRID: Production Grid service Service Resources are open to all VOs supported RB-BDII scope Italian Grid NEW! Resource Broker/UI DAG prod-rb-01. pd. infn. it – n° 7

EGEE/LCG: Production Grid services RB-BDII scope all european resources EGEE/LCG RB/UI with DAG Service

EGEE/LCG: Production Grid services RB-BDII scope all european resources EGEE/LCG RB/UI with DAG Service Resources are open to all VOs supported by INFN-GRID and EGEE/LCG RB: egee-rb-01. cnaf. infn. it support BIOMED VO – n° 8

Upgrade/Installation activity u Testing if "the grid is working" is not so easy; u

Upgrade/Installation activity u Testing if "the grid is working" is not so easy; u Certification levels: activity in INFN-GRID can be classified into four n Local tests by the local resource center managers; n Certification tests by CMT Team; n Monitor tests by CMT Team; n The fourth level, certification on demand, made both by CMT Team and Application Teams. – n° 9

Certification activity – TEST ZONE u The Central Management Team is responsible of the

Certification activity – TEST ZONE u The Central Management Team is responsible of the resource centers certification: checking the functionalities of a site before joining the site to the production grid. u Although all certification jobs are VO independent, the INFNGRID VO is used to perform these jobs; u In n particular are checked: s GIIS' information consistence; s Local jobs submission (LRMS); s Grid submission with Globus (globus-job-run); s Grid submission with the Resorce. Broker; s Replica. Manager functionalities; s MPI functionalities order to certificate a site the CMT uses dedicated grid services: RB & BDII: gridit-cert-rb. cnaf. infn. it u In this way we avoid to have an uncertified site in the production grid services; – n° 11

Periodic test u We periodically submit certification jobs to the sites in order to

Periodic test u We periodically submit certification jobs to the sites in order to pro-actively find ‘troubles’ before users find them. u CMT and system managers, could notify advices about their resources via web inserting a “Downtime advices”. u The Calendar shows the snapshot of the Production Service Status. – n° 12

Ticketing system u INFN-GRID s s s n from users to ask questions or

Ticketing system u INFN-GRID s s s n from users to ask questions or to communicate troubles; from system manager to communicate about common grid tasks (ex: upgrading to a new grid release) from CMT to system manager to notify a problem Support Groups are “helper” groups and they exist to resolve the obvious problems arising with the grow of the grid: n n n ticketing system is used: Support Grid Services (RB, RLS, VOMS, Grid. ICE, etc) Group; Support VO Services Group (each for every VO); Support VOApplications Group (each for every VO); Support Site Group (each for every site) Operative Groups Operative Central Management Team (CMT); n Operative Release & Deployment Team; Users -> Create a ticket Supporters/Operatives -> Open the ticket Users and/or Supporters/Operatives -> Update an open ticket Supporters/Operatives -> Close the ticket – n° 13

Why a “cert” queue ? u. A CE could exist in many BDIIs with

Why a “cert” queue ? u. A CE could exist in many BDIIs with different purpose(EGEE, LCG, VO specific) u After a site upgrade, just as soon as queues were opened, a lot of jobs arrived from anywhere to an uncertified (and unsecure) site and making impossible its fully certification. u To avoid this, all sites joining INFN-GRID have a cert queue (both with PBS and LSF): n n n High priority queue; Only open to VO INFNGRID; With a low max cpu time (10 minutes); n After site installation/upgrade, only the cert queues is opened; n After certification tests by CMT, every other queues will be opened; In addiction, in this way, all periodic test jobs by ROC submitted to the cert queue will always have a higher priority than the other jobs. – n° 14

BDII - ROC setup u All the sites, certified by the ROC team using

BDII - ROC setup u All the sites, certified by the ROC team using the test zone are added to the INFN-GRID production BDII accessible via web. u Each ROC should create, manage and publish via web the region BDII configuration n Similar to http: //grid-it. cnaf. infn. it/fileadmin/bdii/gridit-bdiiupdate. conf u The ROC is ‘authoritative’ for its BDII, it is the master copy of CE and SE of his region n Operations relatedwith ROC resource centers are reflected in the BDII content (scheduled downtime, planned upgrade, site certification failure) – n° 15

Grid. AT - Grid Application Test Grid. AT has the main goal to provide

Grid. AT - Grid Application Test Grid. AT has the main goal to provide a general and flexible framework for VO application tests in a grid system. It permits to test a grid site from the VO viewpoint. Results are stored in a central database and browsable on a web page so it will be also used for certification and test activity. – n° 16

Attivita’ in corso u u Sistema di supporto: integrazione in EGEE e copertura supporto

Attivita’ in corso u u Sistema di supporto: integrazione in EGEE e copertura supporto distribuito Evoluzione di Gridice per job monitoring, application monitoring, SLA monitoring, urgente configurazione notifiche u Integrazione di DGAS in INFN-GRID amministrazione sistema di accounting u Porting di INFN-GRID a SL : nuovo sistema di installazione e configurazione u Operation support infrastruttura EGEE/LCG a ‘rotazione’ tra IT/CERN/UK/FR u Training: corso base e avanzato u Allargamento infrastruttura a sedi non INFN: Spaci, Enea, etc u Amministrazione Policy u Pre-production service per definire il programma di migrazione a Glite u Middleware certification testbed u Operational requirements per il middleware – n° 17

Open issue u Interazione con le VO e gli utenti u Interazione con EGEE/LCG

Open issue u Interazione con le VO e gli utenti u Interazione con EGEE/LCG NA 4 JRAx etc u Resource allocation policy – n° 18

Useful links u INFN n http: //grid-it. cnaf. infn. it/ u INFN n test

Useful links u INFN n http: //grid-it. cnaf. infn. it/ u INFN n test and certification http: //grid-it. cnaf. infn. it/index. php? sitetest&type=1 u INFN n Grid. ICE http: //grid-it. cnaf. infn. it/index. php? grisview&type=1 u INFN n Production Grid Support http: //grid-it. cnaf. infn. it/index. php? id=51&type=1 u Contact n grid-manager@infn. it n Grid-release@infn. it n Ticket for operational issue – n° 19