WLCG Weekly Service Report Harry Renshallcern ch WLCG

Introduction • The ‘weekly’ report is now back to one week (MB normal schedule)

Site Reports (1/2) • CERN: • RAL: • ASGC: • Large number of software

Site Reports (2/2) • BNL: • CNAF: • • On Friday (08/29) at 7:

Experiment reports (1/3) • ALICE: • On 1 September there was an unscheduled power

Experiment reports (2/3) • CMS: • Magnet running at 3 Tesla over the end-August

Experiment reports (3/3) • ATLAS: • CASTOR services at RAL resumed early in the

Summary (Too) many software upgrades and we do not see this slowing down. Many

Slides: 8

Download presentation

WLCG ‘Weekly’ Service Report Harry. Renshall@cern. ch ~~~ WLCG F 2 F Management Board, 9 September 2008

Introduction • The ‘weekly’ report is now back to one week (MB normal schedule) • Last week (1 to 7 September): • Notes from the daily meetings can be found from: • https: //twiki. cern. ch/twiki/bin/view/LCG/WLCGOperations. Meetings • (Some additional info from CERN C 5 reports & other sources) • Increased regular local participation from FIO and GD. • Systematic remote participation by BNL, RAL and PIC and occasional from CNAF, GRIF and STRASBOURG. 2

Site Reports (1/2) • CERN: • RAL: • ASGC: • Large number of software changes during the week with rolling kernel upgrades on all quattor managed machines but mostly transparent. • The worker nodes upgrades resulted in an LSF problem whereby the appearance of a new system directory stopped LSF knowing the cpu speed and memory attributes of a node delaying scheduling of some jobs. A patch was received the same day. • There was an SRM scheduled intervention on 3 September when all CERN SRM 2 endpoints were down for 1 hour from 10. 00 to 11. 00. • There was an LSF hiccup during the daily reconfiguration on 4 September when it failed to read a local configuration file and many jobs were put into the lost+found queue. Manual intervention put the jobs back in their correct queues and a possible timeout window problem has been identified inside the LSF reconfiguration. • The long-standing CASTOR 2 bulk-insert ‘Oracle constraint violation’ problem has been fixed for all instances over last week with configuration changes firstly to the ATLAS/LHCb RAC and then to the RAC hosting the common nameserver. A post-mortem analysis is in preparation. Also had several other unrelated CASTOR Oracle DB problems during the week with a few hours of downtime. • Upgraded to castor 2. 1. 7 so warned of RAL problems. Seeing some 32/64 bit related problems. 3

Site Reports (2/2) • BNL: • CNAF: • • On Friday (08/29) at 7: 01 am data transfer failures to several sites were reported by automated service probes at BNL indicating network connectivity problems on the OPN level. BNL later received a message back from the USLHCnet NOC at 1: 24 pm confirming that there is an outage of the circuit provided by Colt. A backup circuit was believed to be in place but was not operational. Around 8: 30 pm while data replication from/to BNL was progressing well, connectivity issues were observed between hosts at CERN connected via the OPN and the Pan. DA servers/services running at BNL. At the time there were no problems reaching the Pan. DA services at BNL from CERN hosts that are not routed via the OPN (e. g. lxplus). As a temporary workaround policy based routing was re-enabled in the BNL firewall. This issue was followed up with priority at BNL and CERN and as a result some configuration changes have been made at BNL. (Update at meeting: problems understood but fixes not yet in place) GGUS alarm ticket routing was found to be incomplete since 17 July. Fortunately one real alarm ticket was raised before this was corrected (on 4 September) and that had anyway been solved. Have performed Oracle July patch upgrades and plan to upgrade to CASTOR 2. 1. 7 this week (!) with a downtime for all VOs. (Update at meeting: postponed till next week) General: • • Glite 3. 1 update 30 was released. It includes the fix to the vdt limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and ALICE. Update 30 contained a new version of gfal/lcg_utils that was incompatible with g. Lite 3. 0 top-level BDIIs. Advisory EGEE broadcasts to upgrade/point to others were made. (Update at meeting: g. Lite 3. 0 BDII had already been announced to be no longer supported) 4

Experiment reports (1/3) • ALICE: • On 1 September there was an unscheduled power cut in the ALICE server room (bat 30? ) which lasted beyond their local UPS protection time. ALICE production, dependent on these ALIEN central services, stopped for 2 hours. • A new fully backwards compatible release of ALIEN has been made. Already installed at CERN and some smaller sites. Makes subsequent upgrades of VOboxes much easier. • LHCb: • Ongoing data access problems at NL-T 1 needing daily restarts of services. • Have issued GGUS tickets requesting that space tokens missing at some Tier 1 be urgently set up. 5

Experiment reports (2/3) • CMS: • Magnet running at 3 Tesla over the end-August weekend taking cosmics. Migration of data to CAF found to be failing. Was debugged but the backlog took 2 -3 days to recover. • On 2 September DEFAULT service class of CASTORCMS instance was temporarily not available. Understood to be caused by a single user filling all the scheduling slots available in the castorcms default service class. May be using cmscaf : being addressed. CAF heavily used: Castor pools/subclusters on cmscaf reached a 3. 8 GB/s peak at 2 am CERN time • Performed last midweek 2 -day (Wednesday+Thursday) global cosmics run before beam. • Requested a customised elog service to be run by IT (initial setup being done by GS). Their elogs over the last two weeks show that most problems at Tier 1 s are solved within 24 hours but this is not the case for Tier 2 s. 6

Experiment reports (3/3) • ATLAS: • CASTOR services at RAL resumed early in the week but 4000 files permanently lost due to an earlier unrelated disk failure. • Known problems exporting data to NL-T 1 where the storage setup is thought to be out of balance between disk and tape. (Update at meeting: +700 TB of disk has now passed acceptance at SARA and is being prepared for use) • Daily conference calls started with ASGC to resolve problem of lack of disk space due to large volume of ‘dark’ (i. e. not catalogued) data. • Overnight on 3 rd monitoring for the dashboard for Tier 0 and data export stopped with huge load of callbacks and timeouts at the server side. Fixed during the day – I do not know if it was site services, DDM or dashboard related. • Started program to simulate prestaging, reprocessing job load and use of conditions database by running a large reprocessing test of FDR and cosmics data. 7

Summary (Too) many software upgrades and we do not see this slowing down. Many miscellaneous failures which will also continue. To maintain good quality LCG services over the next months is going to require constant vigilance and be labour intensive. 8