WLCG Service Report Jamie Shierscern ch WLCG Management

  • Slides: 16
Download presentation
WLCG Service Report Jamie. Shiers@cern. ch ~~~ WLCG Management Board, 13 th July 2010

WLCG Service Report Jamie. Shiers@cern. ch ~~~ WLCG Management Board, 13 th July 2010

WLCG Operations Report – Summary KPI Status Comment GGUS tickets Numerous real alarms Drill-down

WLCG Operations Report – Summary KPI Status Comment GGUS tickets Numerous real alarms Drill-down on real alarms; Site Usability Minor issues Drill-down to be provided SIRs & Change assessments Several SIRs …and quite a few pending… VO User Team Alarm Total ALICE 3 0 0 3 ATLAS 30 70 7 107 CMS 6 5 1 12 LHCb 4 25 1 30 Totals 43 100 9 152 The response to alarms – expert intervention & problem resolution – continues to be (well) within targets. Should we establish rather a metric related to the frequency 2 and nature of such alarms? (Want to see progress – even if slow…)

0. 1 1. 3 1. 1 1. 4 1. 2 0. 1 3. 1

0. 1 1. 3 1. 1 1. 4 1. 2 0. 1 3. 1 4. 2

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0. 1 CERN-PROD: Castor

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0. 1 CERN-PROD: Castor related problem to export data from T 0, all attempts of writing and reading from T 0 have been timing out. Problem was identified regarding very high levels of logging. Logging daemon reset. ATLAS 1. 1 NDGF-T 1: Schedule downtime from 1200 Hrs to 1400 Hrs. Upgrade of d. Cache on head nodes as well as OS patching. Aiming to keep actual outage much shorter, if all goes well. 1. 2 Taiwan-LCG 2: Temporary SRMv 2 Test timeout. 1. 3 INFN-T 1: SRM Test terminated due to temporary communication error. 1. 4 NIKHEF: SRM was overloaded with ls operations by a biomed user. Other users got time outs. Fixed by asking user to stop. ALICE Nothing to report. CMS 3. 1 KIT: Problem with CMS head nodes for d. Cache - down for about 3 hours. H/W failure. LHCb 4. 1 CERN-PROD: SAM tests failing against CERN since a week due to a diskserver in the lhcbuser pool used for the tests that has a filesystem problem. 4. 2 CNAF: LHCb storage out due to network (switch) failures. Fixed early morning around.

0. 2 2. 1 1. 3 2. 2 0. 2 1. 1 0. 1

0. 2 2. 1 1. 3 2. 2 0. 2 1. 1 0. 1 1. 3 0. 1 1. 2 3. 1 0. 2 4. 1

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0. 1 RAL-LCG 2:

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0. 1 RAL-LCG 2: Unscheduled outage. Site in downtime due to site wide networking issue. 0. 2 FZK-LCG 2: Grid. Ka had a complete power failure. Compute node down till Monday. ATLAS 1. 1 INFN-T 1: Temporary test failures 1. 2 TAIWAN-LCG 2: Temporary test failures 1. 3 RAL-LCG 2: Some problems with ATLAS s/w server end of week and into weekend ALICE 2. 1 FZK-LCG 2: Momentarily VOBOX-Proxy-Registration test failure 2. 2 NIKHEF: alice-box-proxyrenewal service text failed CMS 3. 1 KIT: Temporary test failures. 3. 2 CERN: Problems with the srm-cern which caused transfers to CERN to fail. LHCb 4. 1 NIKHEF: SRM outage. Extended until Monday - difficult to pinpoint and reproduce it. Vendor suspects firmware issue.

GGUS summary (2 weeks) VO User Team Alarm Total ALICE 3 0 0 3

GGUS summary (2 weeks) VO User Team Alarm Total ALICE 3 0 0 3 ATLAS 30 70 7 107 CMS 6 5 1 12 LHCb 4 25 1 30 Totals 43 100 9 152 120 Total ALICE Total ATLAS Total CMS 100 80 60 40 20 0 Total LHCb

Support-related events since last MB • The SIR by KIT for the 2010/05/12. de

Support-related events since last MB • The SIR by KIT for the 2010/05/12. de DNS incident is still pending. Details in savannah: 114518 • Prolonged infrastructure downtimes should IOHO be included as part of “WLCG prolonged downtime strategies” WLCG T 1 SCM • The cases of failing GGUS email notifications To SARA and From CERN are now traced down to parsing scripts in both locations and fixed. Successful ALARM test tickets GGUS: 59769 and GGUS: 59775 confirm this. Details in savannah: 115137 • The Grid. Ka cooling system failure incident of 2010/07/10 requires a SIR. 10/3/2020 WLCG MB Report WLCG Service Report 9

ATLAS ALARM->CERN CASTOR What time UTC What happened 2010/06/28 05: 00 GGUS ALARM ticket

ATLAS ALARM->CERN CASTOR What time UTC What happened 2010/06/28 05: 00 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/06/28 05: 05 Service mgr working on the problem. 2010/06/28 08: 07 Pb traced down to excessive logging information. Service mgr puts ticket ‘solved’. 2010/06/28 09: 26 Submitter puts ticket to status ‘verified’. • https: //gus. fzk. de/ws/ticket_info. php? ticket=59441 10/3/2020 WLCG MB Report WLCG Service Report 10

CMS ALARM->CERN AFS What time UTC What happened 2010/06/29 18: 45 GGUS ALARM ticket

CMS ALARM->CERN AFS What time UTC What happened 2010/06/29 18: 45 GGUS ALARM ticket opened, automatic email notification to cms-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/06/29 18: 49 Operator contacts Service mgr on the problem. 2010/06/29 21: 43 Pb traced down to an afs disk array failure. Service mgr makes a reset and puts ticket ‘solved’. 2010/06/29 22: 21 Submitter puts ticket to status ‘verified’. • https: //gus. fzk. de/ws/ticket_info. php? ticket=59547 10/3/2020 WLCG MB Report WLCG Service Report 11

LHCB ALARM->INFN-T 1 STORM What time UTC What happened 2010/07/02 08: 09 GGUS ALARM

LHCB ALARM->INFN-T 1 STORM What time UTC What happened 2010/07/02 08: 09 GGUS ALARM ticket opened, automatic email notification to t 1 -alamrs@cnaf. infn. it AND automatic assignment to ROC_Italy. 2010/07/02 09: 40 Submitter says the problem went away. Reason was an unavailable 10 Gbit link between 2 gridftp servers. 2010/07/02 14: 11 Supporter puts ticket to status ‘solved’. 2010/07/08 10: 36 Submitter puts ticket to status ‘verified’. • https: //gus. fzk. de/ws/ticket_info. php? ticket=59643 10/3/2020 WLCG MB Report WLCG Service Report 12

ATLAS ALARM->CERN CASTOR SRM What time UTC What happened 2010/07/07 21: 34 GGUS ALARM

ATLAS ALARM->CERN CASTOR SRM What time UTC What happened 2010/07/07 21: 34 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/07/07 21: 59 Submitter reports service degradation. Operator contacts service mgr on the problem. 2010/07/07 22: 01 Service mgr confirms, in the ticket, on-going pb investigation. 2010/07/07 22: 50 Developer finds a process stuck due to a rsyslog bug. Process restarted. Service mgr puts ticket to status ‘solved’. • https: //gus. fzk. de/ws/ticket_info. php? ticket=59848 10/3/2020 WLCG MB Report WLCG Service Report 13

ATLAS ALARM->CERN CASTOR SRM What time UTC What happened 2010/07/08 12: 32 GGUS ALARM

ATLAS ALARM->CERN CASTOR SRM What time UTC What happened 2010/07/08 12: 32 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/07/08 12: 45 Submitter reports T 0 data export problems. Same as GGUS: 59848, 59850. Operator contacts the service piquet. 2010/07/08 12: 49 Service mgr investigating. 2010/07/08 12: 59 … discovery of a rsyslog bug and its config. change as per above-mentioned tickets. 2010/07/08 15: 07 Service mgr puts the ticket to status ‘solved’. 2010/07/08 16: 34 [ Problems with GGUS<->Remedy ticket exchange to be followed up ] • https: //gus. fzk. de/ws/ticket_info. php? ticket=59850 10/3/2020 WLCG MB Report WLCG Service Report 14

Alarm Summary Date Site Service 28/06 CERN CASTOR 29/06 CERN AFS 02/07 CNAF Sto.

Alarm Summary Date Site Service 28/06 CERN CASTOR 29/06 CERN AFS 02/07 CNAF Sto. RM 07/07 CERN CASTOR SRM 08/07 CERN CASTOR SRM Site Service Area Frequency CERN Data / Storage 4 CNAF Data / Storage 1 TOTALS Data / Storage 100% 15

Summary • Good response to GGUS alarms continues – frequency high (but bearable in

Summary • Good response to GGUS alarms continues – frequency high (but bearable in the short term? ) for support staff as well as for users… • No significant reduction can be expected without an analysis of where the most impact could be achieved – and change – which comes with risk • Good match between Site Usability plots and problems reported through daily meetings 16

Workshop Actions - Draft • SIR template and Mo. U-based wording to categorize service

Workshop Actions - Draft • SIR template and Mo. U-based wording to categorize service degradation / downtimes; • Monitoring; • Prolonged site (service) downtimes; • Squid “as a WLCG service” • None of these are new items – most have been discussed explicitly at WLCG T 1 SCM meetings earlier this year • Need to review summary slides to ensure that list is exhaustive and prioritize – including matching against EGI In. SPIRE SA 3 manpower (now largely in place or agreed) • Also proposed to hold daily WLCG operations meetings – chaired e. g. by a Tier 1 – when CERN closed (Jeune Genevois etc. ) 17