WLCG Service Report Olof Barringcern ch WLCG Management
WLCG Service Report Olof. Barring@cern. ch ~~~ WLCG Management Board, th 7 July 2009
Introduction • Quiet week again • Decreasing participation • No alarm tickets • Incidents leading to postmortem • ATLAS post-mortem • FZK posted a post-mortem explaining their tape problems during STEP 09 • RAL scheduled downtime for move to new Data Centre • ASGC recovering? 2
Decreasing participation STEP 09 3
GGUS summary VO User Team Alarm Total ALICE 2 1 0 3 ATLAS 9 13 0 22 CMS 3 0 0 3 LHCb 1 21 0 22 Totals 15 35 0 50 4
LHCb Team Tickets drifting up ? Jobs failed or aborted at Tier 2 8 tickets (5 of these 8 still open, all others closed) g. Lite WMS issues at Tier 1 (temporary) 5 Data transfers to Tier 1 failing (disk full) 1 Software area files with root owned 1 CE marked down but accepting jobs 1 Nothing really unusual 5
6
PVSS 2 COOL? incident 27 -6 (1/3) Incident report and affected services: • Sunday afternoon 27 -6 Viatcheslav Khomutnikov(Slava) from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS 2 COOL ? application (on Atlas offline DB). The error started appearing on Saturday (26 -6) evening. 7
PVSS 2 COOL? incident 27 -6 (2/3) Issue analysis and actions taken: • The error stack reported by Atlas indicated that the error was generated by a 'drop table operation' being blocked by the custom trigger set up by Atlas to prevent 'unwanted' segment drop. The trigger is operational since several months. This information was fed back by Physics DB services to Atlas on Sunday evening. On Monday morning Atlas still reported this blocking issue and upon further investigation they were not able to find which table the application (PVSS 2 COOL ? ) wanted to drop (therefore causing the blocking error) as the issue appeared in a block of code responsible for inserting data. Physics DB service in collaboration with Atlas DBAs then ran logmining ' ' of the failed drop operation and found that the application was indeed trying to drop some segments on the recycle bin of the schema owner (ATLAS_COOLOFL_DCS). Further investigations with SQL trace by DBAs the showed that Oracle attempted to drop objects on the recycle bin when PVSS 2 COOL ? wanted to bulk insert data. This operation was then blocked by the custom Atlas trigger that blocks drop in production, hence the error message originally reported. Metalinknote "265253. 1" then further clarified that the issue was a side effect of an expected behaviourof Oracle's space reclamation process. 8
PVSS 2 COOL? incident 27 -6 (3/3) Issue resolution and expected follow-up: • In the evening on 29 -6 Physics DB support in collaboration with Atlas DBAs extended thedatafile of the PVSS 2 COOL ? application to circumvent this space reclamation process issue. Atlas has reported that this has fixed the issue. Further discussions on the role of the recycle bin and on possible improvements of the 'block drop trigger' of Atlas are currently in progress to avoid further occurrences of this issue. 9
FZK tape problems during STEP 09 • • Jos posted a Post-Mortem analysis of the tape problems seen at FZK during STEP 09: https: //twiki. cern. ch/twiki/pub/LCG/WLCGService. Incidents/SIR_storage_FZK_Gri d. Ka. pdf Too long to fit here but in summary • Before STEP 09 • First week of STEP 09 • Second week of STEP 09 • An update to fix a minor problem in the tape library manager resulted in stability problems • Possible cause: SAN or library configuration • Both were tried and problem disappeared but which one was the root cause? • Second SAN had reduced connectivity to d. Cache pools: not enough for CMS and ATLAS at the same time CMS asked to not to use tape • Many problems: hw (disk, library, tape drives), sw (TSM) • Added two more dedicated stager hosts resulted in better stability • Finally getting stable rates 100 – 150 MB/s 10
RAL scheduled downtime for DC move • Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7 • Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call • Planning and detailed progress reported at : http: //www. gridpp. rl. ac. uk/blog/category/r 89 -migration R 89 Migration: Friday 3 rd July Posted by Andrew Sansum 12: 00 Our last dash towards restoration of the production service is under way. All racks of disk servers have now had a first pass check. Faults list is currently 11 servers, although some of these may well be trivial. We expect to provide a large number of disk servers to the CASTOR team later today. 11
ASGC instabilities • ATLAS reported instabilities in beginning of week • Monday: • Functional tests worked but still some problem with. Tier-1 Tier-2 transfers • Another unscheduled downtime (recabling of CASTOR disk servers) • CMS allowed the full week grace period for ASGC to recover from all its problems • No new tickets and opened tickets put on hold • Resume on Monday 6/7 • Both ATLAS and CMS specific site tests changed from Red to Green during the week • Friday 3/7: Gang reports that tape drives and servers are online 12
Summary • Daily meeting attendance is degrading – holidays…? • No new serious site issues • RAL long downtime for DC move is progressing to plan. (Tuesday report – RAL back apart CASTORATLAS, some network instability). • Tape problems at FZK during STEP 09 understood • ASCG is recovering? 13
- Slides: 13