Service Availability Monitor tests for ATLAS Current Status
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED 4 Dec 2007 Alessandro Di Girolamo
SAM Critical Tests: Current Status Now running standard OPS tests using ATLAS credentials (i. e. the original SAM tests run under the ATLAS VO) • List of sites from GOCDB • SE & SRM: § § § put: lcg-cr using cern-prod LFC, files in SAM test directory get: lcg-cp from site to the SAM UI del: lcg-del - clean the catalog and the storage • CE § Check CA RPMs version § Job Submission on a WN tests § VO swdir (sw installation directory) • LFC § lfc-ls, lfc-mkdir • FTS § glite-transfer-channel-list, Information System configuration and publication 4 Dec 2007 Alessandro Di Girolamo 2
Work in progress We are developing and testing ATLAS-specific SAM tests in order to: • monitor the availability of ATLAS critical Site Services • verify the correct installation and the proper functioning of the ATLAS software on each site SE & SRM & CE endpoints definition: intersection between GOCDB and Tiers. Of. ATLAS (ATLAS specific sites configuration file with Cloud Model) § different services and endpoints might need to be tested using different VOMS credentials § ATLAS endpoints and paths must be explicitly tested (i. e. /dq 2 area) § the LFC of the Cloud (residing in the T 1) is used 4 Dec 2007 Alessandro Di Girolamo 3
Development: Tests and Alarms • SE & SRM (centrally from SAM UI): – put: lcg-cr with Cloud LFC, with and without using BDII infos – get: lcg-cp • CE (job submitted on each ATLAS CE): – keep on running large part of OPS suite – for ATLAS Tier 1 and Tier 2: • Check the presence of the required version of the ATLAS sw • Compile and execute a real analysis job based on a sample dataset • Test put/get to local storage via native protocols (dccp, rfcp …) Alarm system: • SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System (mail and/or sms) • Grid Services (FTS, LFC etc. ) tests failing: alarms to § Service responsible § the ATLAS dedicated services (DDM, etc. . ) that use those services 4 Dec 2007 Alessandro Di Girolamo 4
Reliability & Availability results SAM Critical Tests not reliable for: – France: BDII configuration (ATLAS endpoint should be explicitly put) – NDGF/BNL: different service setup SAM Critical Tests last months failures: – FZK: real SRM failures. Problems under investigation with site responsible – SARA: (mainly) not scheduled network problems 4 Dec 2007 Alessandro Di Girolamo 5
To Do • New ATLAS specific tests (now running in pre -production) will be more realistic for the Experiment • Improve completeness of monitor informations § Informations across Tiers. Of. ATLAS, GOCDB and BDII. § ATLAS Cloud topology view § Integration with Ganga Robot and other ATLAS tools § Integration with the ATLAS dashboard 4 Dec 2007 Alessandro Di Girolamo 6
Backup slides • … 4 Dec 2007 Alessandro Di Girolamo 7
SAM ATLAS SE (SRM) tests All SRM endpoints (v 1 and v 2) can be considered as SE: • SE tests are sent to the list of SRM endpoints resulting from the intersection of To. A & GOCDB 4 Dec 2007 Alessandro Di Girolamo 8
SAM ATLAS SE (SRM) tests All SRM endpoints (v 1 and v 2) can be considered as SE: • SE tests are sent to the list of SRM endpoints resulting from the intersection of To. A & GOCDB 4 Dec 2007 Alessandro Di Girolamo 9
SAM results on Gridmap Topology: Possibility to include ATLAS Cloud view, Possibility to change the metrics for the sites size The collaboration with the Gridmap developers is already started Thks to CERN openlab / EDS 4 Dec 2007 Alessandro Di Girolamo 10
Other SAM tests Many more tests, not critical, are running 4 Dec 2007 Alessandro Di Girolamo 11
Site Availability: T 0/T 1 Site Services Availability: Site Services X = CE, SRM Down: if all services of type X of a site are Down Ok: if all services of type X are Ok Degraded: if some services of type X are Ok and other are Down Site BDII: Ok or Down by taking the status of the site BDII instance Site Availability: The AND of each single Site Services Availability 4 Dec 2007 Alessandro Di Girolamo 12
Site Availability: one example 4 Dec 2007 Alessandro Di Girolamo 13
Storage Space Monitor via SAM A specific SAM test could be sent on the VOBOXes to check storage disk space, as already done for the IT cloud 4 Dec 2007 Alessandro Di Girolamo 14
- Slides: 14