SEEGRIDSCI Operations Procedures and Tools www seegridsci eu

  • Slides: 26
Download presentation
SEE-GRID-SCI Operations Procedures and Tools www. see-grid-sci. eu Regional SEE-GRID-SCI Training for Site Administrators

SEE-GRID-SCI Operations Procedures and Tools www. see-grid-sci. eu Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5 -6, 2009 Antun Balaz Institute of Physics Belgrade, Serbia antun@scl. rs The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP 7 Research Infrastructures contract no. 211338

Overview SEE-GRID operational and monitoring tools (and their relation to EGEE tools) § §

Overview SEE-GRID operational and monitoring tools (and their relation to EGEE tools) § § § HGSM/GOCDB Helpdesk/GGUS BBm. SAM/SAM GStat Nagios/CIC portal Accounting portal Downtime procedures Upgrade procedures Grid-Operator-On-Duty (GOOD) Service Level Agreement (SLA) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Operational & monitoring tools HELPDESK GSTAT (Taiwan) VOMS SAM BBm. SAM BDII Accounting NAGIOS

Operational & monitoring tools HELPDESK GSTAT (Taiwan) VOMS SAM BBm. SAM BDII Accounting NAGIOS R-GMA HGSM Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

HGSM/GOCDB (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

HGSM/GOCDB (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

HGSM/GOCDB (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

HGSM/GOCDB (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

HGSM/GOCDB (3) Static database containing all relevant data about all SEE-GRID and AEGIS sites

HGSM/GOCDB (3) Static database containing all relevant data about all SEE-GRID and AEGIS sites Must be kept synchronized with the real situation § All sheets must be properly updated q q Site Info Contacts Site Nodes Downtimes § XML dumps – the easiest way to apply changes is to download XML dump of the data, edit it appropriately, and then upload the new XML file; this also allows keeping of backups Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

HGSM/GOCDB (4) The essential fields in HGSM: § § § GIIS URL Monitoring: Yes

HGSM/GOCDB (4) The essential fields in HGSM: § § § GIIS URL Monitoring: Yes Status: certified Type: seegrid_production, seegrid_certified, egee_production Site Commitments Contacts and administrators All fields have to have correct values! URL: https: //hgsm. grid. org. tr/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Helpdesk/GGUS (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

Helpdesk/GGUS (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Helpdesk/GGUS (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

Helpdesk/GGUS (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Helpdesk/GGUS (3) Central reference point for tracking of all operational and user problems Identified

Helpdesk/GGUS (3) Central reference point for tracking of all operational and user problems Identified problems are reported through the Helpdesk and assigned to the appropriate supported If problems cannot be solved within the SEE-GRID community, they are propagated to other projects/initiatives/support systems (e. g. GGUS) URL: https: //helpdesk. see-grid. eu/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

BBm. SAM/SAM Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

BBm. SAM/SAM Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

BBm. SAM History Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March

BBm. SAM History Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

BBm. SAM Portal that provides access to the database of SAM tests results Central

BBm. SAM Portal that provides access to the database of SAM tests results Central tools for identification of operational problems Should be checked by each site admin on a daily basis Should be used to troubleshoot problems Also provides SLA figures URL: https: //c 01. grid. etfbl. net/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

GStat (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5

GStat (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

GStat (2) Central tool for monitoring of the information system of SEE-GRID infrastructure Provides

GStat (2) Central tool for monitoring of the information system of SEE-GRID infrastructure Provides useful data Identifies problems with sites Should be checked by each site admin on a daily basis and used for troubleshooting § Useful ldapsearch commands can be found on GStat pages! URL: http: //goc. grid. sinica. edu. tw/gstat/seegrid/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Nagios/CIC portal (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March

Nagios/CIC portal (1) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Nagios/CIC portal (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March

Nagios/CIC portal (2) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Nagios/CIC portal (3) Collection of alarms raised by various tools The aim is to

Nagios/CIC portal (3) Collection of alarms raised by various tools The aim is to integrate all the tools and make the life of site admins and infrastructure managers easier In the future, automatic creation of Helpdesk tickets will be implemented URL: https: //portal. ipp. acad. bg: 7443/seegridnagios/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Accounting portal (1) • Accounting by site • Accounting by countries and institutions •

Accounting portal (1) • Accounting by site • Accounting by countries and institutions • Accounting by applications Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

EGEE Accounting portal Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March

EGEE Accounting portal Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Accounting portal (2) Collects the accounting data from all SEE-GRID and AEGIS sites through

Accounting portal (2) Collects the accounting data from all SEE-GRID and AEGIS sites through apel accounting publisher developed by the project Provides aggregated accounting data by site, country, institution, application Each site must publish the accounting data properly URL: https: //gserv 1. ipp. acad. bg: 8443/Welcome/ Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Downtime procedures Downtimes must be announced well in advance (1 week is reasonable time)

Downtime procedures Downtimes must be announced well in advance (1 week is reasonable time) § There always downtime due to hardware etc. failures that cannot be anticipated All downtimes must be entered properly in HGSM § That way they are not be counted against the site’s availability In addition, all downtimes must be broadcasted by email to the GIM, APP and proper VO mailing lists Downtime should not exceed 10% of the total time (monthly, quarterly) § If yes, explanation must be provided § If the explanation is not accepted by the project management, SA 1 claims will be rejected Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Upgrade procedures All upgrades/updates are announced over the GIM list The broadcasts contain links

Upgrade procedures All upgrades/updates are announced over the GIM list The broadcasts contain links to further instructions for upgrades for each Grid service § Site admins should carefully examine them before performing the update! In addition, possible SEE-GRID-specific instructions are given in the e-mail For especially important updates/changes, tickets are created for each site For some upgrades/updates to be performed, downtimes may be required OS updates must be regularly installed, to minimize security risks Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Grid-Operator-On-Duty (GOOD) Rotating shifts on a weekly basis § Each country’s GIM is responsible

Grid-Operator-On-Duty (GOOD) Rotating shifts on a weekly basis § Each country’s GIM is responsible to monitor sites during his/her shift § Tickets are submitted to sites with problems, according to the status of sites in various monitoring tools (BBm. SAM, GStat, Nagios, Accounting portal, etc. ) § Older tickets that are not resolved are escalated § Support is given to sites that cannot resolve earlier identified operational problems § User tickets are assigned to the appropriate supporters § Wiki documentation is updated, or new wiki pages created if necessary URLs: § http: //wiki. egee-see. org/index. php/SG_GOOD § http: //wiki. egee-see. org/index. php/SG_Helpdesk_tickets Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Usual problems and links to (possible) solutions BDII § site. BDII (GIIS) or top-level

Usual problems and links to (possible) solutions BDII § site. BDII (GIIS) or top-level BDII is Unreachable http: //faq. twgrid. org/faq/index. php? action=artikel&cat=14&id=11&artlang=en § No info published http: //goc. grid. sinica. edu. tw/gocwiki/No_data_published_by_top_level_BDII CA § CA version test failed with error message: This CA is an old one and time allowed to upgrade is over http: //grid-deployment. web. cern. ch/grid-deployment/lcg 2 CAlist. html CE (Computing Element) § Job submission failed with error message: Brokerhelper: Cannot plan. No compatible resources: http: //goc. grid. sinica. edu. tw/gocwiki/Brokerhelper%3 A_Cannot_plan. _No_compatible_resources § Job submission failed with error message: Got a job held event, reason: Unspecified gridmanager error http: //goc. grid. sinica. edu. tw/gocwiki/Unspecified_gridmanager_error § Job submission failed with error message: Cannot read Job. Wrapper output, both from Condor and from Maradona http: //goc. grid. sinica. edu. tw/gocwiki/Cannot_read_Job. Wrapper_output%2 e%2 e%2 e § Job submission failed with error message: 7 authentication failed http: //goc. grid. sinica. edu. tw/gocwiki/7_authentication_failed § Job submission failed with error message: 10 data transfer to the server failed http: //goc. grid. sinica. edu. tw/gocwiki/10_data_transfer_to_the_server_failed § 4444 Waiting jobs in the GRIS http: //goc. grid. sinica. edu. tw/gocwiki/4444_Waiting_jobs_in_the_GRIS SE (Storage Element) § File copy and registration failed with error message: 535 -FTPD GSSAPI error: GSS Major Status: General failure http: //goc. grid. sinica. edu. tw/gocwiki/535_535 FTPD_GSSAPI_error%3 A_GSS_Major_Status%3 A_General_failure Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009

Service Level Agreement (SLA) Old URL: http: //wiki. egee-see. org/index. php/SG_SLA The change to

Service Level Agreement (SLA) Old URL: http: //wiki. egee-see. org/index. php/SG_SLA The change to the current one is that the required availability is 80%, and that the availability is calculated on 3 h basis, not on a daily basis BBm. SAM portal provides SLA figures Sites not fully conforming to the SLA will have reduced funding Sites with the availability <50% will be uncertified Sites fully conforming to the SLA will be put into seegrid_certified status and become visible to the whole SEE region (i. e. not only SEE-GRID, but also EGEE-SEE etc. ) Regional SEE-GRID-SCI Training for Site Administrators, Institute of Physics Belgrade, March 5 -6, 2009