WLCG Service Report Andrea Valassicern ch WLCG Management

Introduction • Reduced activity due to technical stop • Real beam again since Friday

GGUS summary (2 weeks) VO User Team Alarm Total ALICE 7 0 0 7

Support-related events since last MB • There were 5 real ALARM tickets since the

ATLAS ALARM->CERN AFS/LSF PROBLEM GGUS: 69320 What time UTC What happened 2011/04/04 09: 04

ATLAS ALARM->CERN CASTOR CAN’T FIND FILE ON DISK POOL GGUS: 69631 What time UTC

ATLAS ALARM->CERN CASTOR ERRORS IN DATA RETRIEVAL GGUS: 69626 What time UTC What happened

ATLAS ALARM->RAL SRM CONNECT GGUS: 69726 What time UTC What happened 2011/04/15 07: 23

ATLAS ALARM->TAIWAN ERRORS IN DATA EXPORT FROM THE TIER 0 GGUS: 69743 What time

Analysis of the availability plots: Week of 04/04/2011 ATLAS 1. 1 RAL-LCG 2. Problems

Analysis of the availability plots: Week of 11/04/2011 ATLAS 1. 1 TAIWAN-LCG 2 (GREEN

Conclusions • Business as usual – quieter during the technical stop • One additional

Slides: 14

Download presentation

WLCG Service Report Andrea. Valassi@cern. ch ~~~ WLCG Management Board, 19 th April 2011

Introduction • Reduced activity due to technical stop • Real beam again since Friday • One Service Incident Report received: • IN 2 P 3 power cut on April 8 th (SIR) • Five GGUS ALARMS for ATLAS • At CERN, RAL and Taiwan – see details in the following slides • Other notable issues reported at the daily meetings • Danish tape system 9 -day outage caused data access problems for ATLAS • HI DESD data corruption for ATLAS at ASGC – will be reprocessed at ASGC • ATLAS conditions Oracle Streams replication stopped for CNAF, NDGF, SARA • Geometry replication added for all other sites • Oracle at CERN patched to stop Prompt. Reco. Injector crashes • Many experts in Vilnius for EGI UF April 11 -14 2

GGUS summary (2 weeks) VO User Team Alarm Total ALICE 7 0 0 7 ATLAS 18 101 5 124 CMS 3 3 0 6 LHCb 2 19 0 21 Totals 30 123 5 158 120 Total ALICE Total ATLAS Total CMS 100 Total LHCb 80 60 40 20 0 27 -Jun 13 -Jan 1 -Aug 17 -Feb 5 -Sep 24 -Mar 10 -Oct

Support-related events since last MB • There were 5 real ALARM tickets since the 2011/04/05 MB (2 weeks), all submitted by ATLAS, notified sites were: • CERN • RAL • Taiwan • Is it desirable for Taiwan to have registered in GOCdb the same email for normal contact and for emergencies (ALARMS)? Here is the relevant GOCdb entry. Details follow… 4

ATLAS ALARM->CERN AFS/LSF PROBLEM GGUS: 69320 What time UTC What happened 2011/04/04 09: 04 SATURDAY GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/04/04 09: 15 Exchange between submitter & LSF expert. No jobs can run on any batch node. Error: ”no token”. Afs expert contacted. NB! Service mgrs reacted before the operators contacted them. 2011/04/04 09: 21 Operator acknowledges and records in the GGUS ticket that the afs service managers were contacted. 2011/04/04 09: 52 Service manager comments in the ticket the problem is gone. 2011/04/04 10: 13 ‘solved’. Reason was an update package with wrong config. file entry which prevented the kerberos servers from providing tokens. 2011/04/04 11: 20 Submitter puts the ticket to status ‘verified’. 5

ATLAS ALARM->CERN CASTOR CAN’T FIND FILE ON DISK POOL GGUS: 69631 What time UTC What happened 2011/04/12 10: 13 GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/04/12 10: 26 Ticket set ‘in progress’ by the service mgr. NB! Service mgrs reacted before the operators contacted them. 2011/04/12 10: 28 Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted. 2011/04/12 10: 40 Expert pastes a Remedy ITCM ticket extract showing the node is down for maintenance, drops the GGUS ticket priority, comments the ALARM is probably unjustified and sets it to ‘solved’. 2011/04/12 10: 49 Submitter sets the ticket to ‘verified’. Asks where the info on node maintenance was published. 6

ATLAS ALARM->CERN CASTOR ERRORS IN DATA RETRIEVAL GGUS: 69626 What time UTC What happened 2011/04/12 08: 35 GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern. ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/04/12 08: 39 Ticket set ‘in progress’ by the service mgr. NB! Service mgrs reacted before the operators contacted them. 2011/04/12 08: 42 Operator acknowledges and records in the GGUS ticket that the castor piquet was called. 2011/04/12 08: 43 Service mgr reports that the server where the files reside is down for maintenance. 2011/04/12 09: 34 Service mgr sets ticket to ‘solved’. 2011/04/12 11: 44 Submitter sets the ticket into status ‘verified‘. 7

ATLAS ALARM->RAL SRM CONNECT GGUS: 69726 What time UTC What happened 2011/04/15 07: 23 GGUS ALARM ticket, automatic email notification to lcg-alarm@gridpp. rl. ac. uk AND automatic assignment to ROC_UK/Ireland. 2011/04/15 07: 26 Submitter records related TEAM ticket GGUS: 69721. Why didn’t they upgrade that to ALARM? 2011/04/15 07: 39 5 simultaneous entries in the ticket diary Service mgr records in the ticket that investigation has started. ATLAS SRM appears to run normally. Ticket bounces back and forth between status ‘Assigned’ and ‘in progress’. 2011/04/15 08: 09 High load observed on ATLAS SRM db. Service put in downtime for investigation. 2011/04/15 08: 26 A (late!) automatic acknowledgment of the ALARM originating from oncall@gridpp. rl. ac. uk 2011/04/15 09: 09 Ticket assignment to lcg-support@gridpp. rl. ac. uk entered, then removed, then re-entered on. Mon 18 th. 2011/04/18 09: 26 Still ‘in progress’. Site was in downtime for ~2 hrs on Fri 15 th for investigation. SRM db was in trouble due to the robot certificate!? Service under-performing. 8

ATLAS ALARM->TAIWAN ERRORS IN DATA EXPORT FROM THE TIER 0 GGUS: 69743 What time UTC What happened 2011/04/15 16: 32 GGUS TEAM ticket, automatic email notification to asgc-t 1 -op@lists. grid. sinica. edu. tw AND automatic assignment to ROC_Asia/Pasific AND submitter puts atlas-support-cloud-tw@cern. ch in Cc. 2011/04/15 16: 53 More info by the submitter. Many and various errors on SCRATCHDISK, DATADISK, SRM version, site/host id. 2011/04/15 16: 56 Site admin sets the ticket ‘in progress’ without comment. 2011/04/15 17: 09 Another TEAMer records diagnosis of bad FTS config. Ticket upgrade to ALARM. Notification email is THE SAME!!!? ? ? asgc-t 1 -op@lists. grid. sinica. edu. tw 2011/04/15 17: 39 Site mgr reports this is due to their SRM BDII problem 2011/04/15 18: 50 Submitter confirms problem is gone. ‘solved’ and ‘verified‘ on 2011/04/16 SATURDAY at 16: 06 and 16: 18 respectively. 9

2. 2 1. 1 2. 3 1. 1 2. 1 1. 2 2. 3 1. 2 1. 3 2. 1 2. 4 1. 3 4. 1 3. 2 4. 3

Analysis of the availability plots: Week of 04/04/2011 ATLAS 1. 1 RAL-LCG 2. Problems with site BDII (2 out of the 3 machines back to the DNS alias were not reporting anything). Fixed by site. 1. 2 TAIWAN-LCG 2. Test errors after the site downtime, related SRMV 2 space-tokens. Solved by VO contact by fixing the tests. 1. 3 TAIWAN-LCG 2. New test errors with SRMV 2 space-tokens. Being investigated by VO contact: it seems tests need to be changed accordingly to the new space-tokens (probably old space-token have been switched off by the site). ALICE 2. 1 RAL-LCG 2. Problems with site BDII (2 out of the 3 machines back to the DNS alias were not reporting anything). Fixed by site. 2. 2 INFN-T 1. Problem being investigated. This was due to the NGI renaming. 2. 3 INFN-T 1. Problem being investigated. It looks like no tests were sent, and monitoring did not set any availability. 2. 4 RAL-LCG 2. Scheduled downtime at WARNING level: ‘System at risk during power work in building hosting networking equipment’ (from 9. 00 AM 09/04/2011 to 5 PM 10/04/2011). CMS 3. 1 T 1_UK_RAL. Same problem as for the other VO’s (1. 1, 2. 1, 4. 1). But for CMS site availability was still ok. 3. 2 T 1_UK_RAL. Scheduled downtime at WARNING level: ‘System at risk during power work in building hosting networking equipment’ (from 9. 00 AM 09/04/2011 to 5 PM 10/04/2011). LHCb 4. 1 LCG. GRIDKA. de. One of the d. Cache pool was down, causing a general slow down of d. Cache performances. 4. 2 LCG. RAL. uk. Problems with site BDII (2 out of the 3 machines back to the DNS alias were not reporting anything). Fixed by site. 4. 3 LCG. RAL. uk. Scheduled downtime at WARNING level: ‘System at risk during power work in building hosting networking equipment’ (from 9. 00 AM 09/04/2011 to 5 PM 10/04/2011). 11

2. 1 1. 2 2. 2 1. 1 2. 2

Analysis of the availability plots: Week of 11/04/2011 ATLAS 1. 1 TAIWAN-LCG 2 (GREEN BOX). Test errors with SRMV 2 space-tokens. VO contact (Alessandro) fixed the space token in the test profiles and the problem was resolved from 13: 00 12/04/2011. 1. 2 RAL-LCG 2. SRM tests reporting status unknown (03: 00 to 16: 00 15/04/2011). Unscheduled OUTAGE “SRM-ATLAS not functioning. Problem under investigation. ” (01: 00 to 11: 35 15/04/2011). Unscheduled AT RISK “At Risk on ATLAS SRM following problems on Friday. ” (11: 35 15/04/2011 to 11: 00 18/04/2011). GGUS: 69726. ALICE 2. 1 INFN-T 1 (GREEN BOX). CREAMCE tests not sent due to transition from Italian ROC to Italian NGI. VO contact (Maarten) fixed the tests and the problem was resolved from 15: 00 11/04/2011. 2. 2 RAL-LCG 2. CREAMCE tests failing with “ 330 min timeout for the job exceeded” (18: 00 13/04/2011 to 11: 00 15/04/2011). VO contact (Maarten) states that the timeout should be 11 hours and will be investigated / corrected. He explains that the bigger issue is that ALICE cannot distinguish SAM jobs and normal user jobs by their credentials, leading to timeouts as SAM jobs wait in queues. CMS Nothing to report. LHCb Nothing to report. 13

Conclusions • Business as usual – quieter during the technical stop • One additional issue (not discussed at the daily meetings): • CMS data taking affected on Friday by #users of MATLAB license • One example of a “hidden dependency”: are there any others? 14