WLCG Service Report Jamie Shierscern ch WLCG Management

  • Slides: 10
Download presentation
WLCG Service Report Jamie. Shiers@cern. ch ~~~ WLCG Management Board, 29 th June 2010

WLCG Service Report Jamie. Shiers@cern. ch ~~~ WLCG Management Board, 29 th June 2010

WLCG Operations Report – Summary KPI Status Comment GGUS tickets “normal” Drill-down on real

WLCG Operations Report – Summary KPI Status Comment GGUS tickets “normal” Drill-down on real alarms; comment on tests. Site Usability Minor issues(? ) Drill-down to be provided SIRs & Change assessments Several SIRs …and quite a few pending… VO User Team Alarm Total ALICE 4 0 3 7 ATLAS 37 80 6 123 CMS 8 8 3 19 LHCb 7 49 2 58 Totals 56 137 14 207 2

Site Usability – 3 Weeks 3

Site Usability – 3 Weeks 3

GGUS summary (3 weeks) VO User Team Alarm Total ALICE 4 0 3 7

GGUS summary (3 weeks) VO User Team Alarm Total ALICE 4 0 3 7 ATLAS 37 80 6 123 CMS 8 8 3 19 LHCb 7 49 2 58 Totals 56 137 14 207 120 Total ALICE Total ATLAS Total CMS 100 80 60 40 20 0 Total LHCb

Alarm tickets • There were test ALARM tickets with the Tier 0 and all

Alarm tickets • There were test ALARM tickets with the Tier 0 and all Tier 1 s as per monthly action from the GGUS Rel. which was on June 23 rd this time. Email notifications signed with the new GGUS DN were also tested. Problems: • At first ALARMs were signed with the wrong certificate. • NDGF never replied to the ALARM. • Some email notifications came with a long delay. • The SIR by KIT for the 2010/05/12. de DNS incident is still pending. Details in https: //savannah. cern. ch/support/? 114518 • GGUS couldn’t send Direct Site Notification on 21 and 24 June due to data unavailability in GOCDB 4. Protective measures taken in GGUS. Details in https: //savannah. cern. ch/support/? 115297 • A hot-spare VO Box should be available at all times. 12/22/2021 5

LHCB ALARM->CERN LFC What time (UTC) What happened 2010/06/18 9: 54 GGUS ALARM ticket

LHCB ALARM->CERN LFC What time (UTC) What happened 2010/06/18 9: 54 GGUS ALARM ticket opened, automatic email notification to lhcb-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/06/18 10: 03 Site operator acknowledges and emails fts. support 2010/06/18 10: 22 Still debugging for 2 hrs. Pb traced down to a CA/CRL refresh. 2010/06/18 10: 22 Submitter (with support privileges) puts ticket ‘solved’. Vomscert was not updated. Low-value ‘solution’ content in the relevant PRMS ticket. • https: //gus. fzk. de/ws/ticket_info. php? ticket=59174 12/22/2021 6

ATLAS ALARM->CERN CASTOR What time What happened 2010/06/22 13: 33 GGUS ALARM ticket opened,

ATLAS ALARM->CERN CASTOR What time What happened 2010/06/22 13: 33 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN 2010/06/22 13: 44 Service expert working on the problem. 2010/06/22 14: 40 Pb traced down to a LDAP overload. LDAP restarted. 2010/06/22 14: 47 Service manager puts ticket ‘solved’. Vomscert was not updated. Is the ‘solution’ content in the relevant PRMS ticket important for FAQs or Knowledge Base? • https: //gus. fzk. de/ws/ticket_info. php? ticket=59269 12/22/2021 7

ATLAS ALARM->SARA STORE What time What happened 2010/06/27 15: 47 GGUS ALARM ticket opened,

ATLAS ALARM->SARA STORE What time What happened 2010/06/27 15: 47 GGUS ALARM ticket opened, automatic email notification to grid. support@sara. nl AND automatic assignment to NGI_NL. 2010/06/27 21: 41 Site manager working on the problem. 2010/06/27 22: 35 Pb traced down to a dead storage node and a d. Cache bug. Service manager writes and quick fix and puts ticket ‘solved’. 2010/06/27 23: 34 Submitter happy. NB!!! This ticket was re-submitted as https: //gus. fzk. de/ws/ticket_info. php? ticket=59435 as site claims they got no notification!!! Follow-up in https: //savannah. cern. ch/support/? 115137 • https: //gus. fzk. de/ws/ticket_info. php? ticket=59433 12/22/2021 8

Conclusions – Alarms • Experiments have used GGUS alarms (too? ) sparingly • The

Conclusions – Alarms • Experiments have used GGUS alarms (too? ) sparingly • The fact that there are regular alarms means that there are weaknesses in the service / procedures (IMHO) • e. g. could some of the recent alarms have been avoided? (Y) Ø Posit: an alarm is when you need “a fireman” Ø Or maybe an ambulance… • Current alarm chain is not sufficiently reliable • Would you wait several hours if you really needed a pompier? • Should we really accept alarms as being the norm – at least have some metric for seeing their frequency decrease? 9

Summary • The alarm mechanism is clearly necessary • Still some problems seen with

Summary • The alarm mechanism is clearly necessary • Still some problems seen with end-end notification • SIRs & other medium-long term problems: need follow-up – there are quite a few complex(? ) problems that never seem to get fully debugged • Either requiring expertise from multiple people and/or multiple sites • We will review the overall SIR repository at the WLCG Collaboration workshop in London, and drill-down on one specific multi-site, multi-provider case 10