WLCG Weekly Service Report Harry Renshallcern ch th

  • Slides: 10
Download presentation
WLCG ‘Weekly’ Service Report Harry. Renshall@cern. ch ~~~ th July 2008 WLCG Management Board,

WLCG ‘Weekly’ Service Report Harry. Renshall@cern. ch ~~~ th July 2008 WLCG Management Board, 22

Introduction • This ‘weekly’ report covers two weeks (MB summer schedule) • Last week

Introduction • This ‘weekly’ report covers two weeks (MB summer schedule) • Last week (7 to 12): • Tuesday: MB F 2 F • Wednesday: GDB, C-RSG • Friday: OB • This week(14 to 20): • Monday: CMS CRUZET 3 cosmic ray run finished • Notes from the daily meetings can be found from: • https: //twiki. cern. ch/twiki/bin/view/LCG/WLCGOperations. Meetings • (Some additional info from CERN C 5 reports & other sources) 2

C-RSG All reviewers have had one or more meetings with their experiments and are

C-RSG All reviewers have had one or more meetings with their experiments and are filling in a common (but adaptable) template leading to 2009 resource requirements. Executive summary per experiment: ALICE: Have had email exchanges and a teleconference and ALICE have completed their template but follow up is needed. ATLAS: Only one reviewer was available. First iteration of template done but only partially completed. The second reviewer is now active. CMS: Template fully complete (matches immediately the CMS computing model). HI running is not being reviewed at this time (separately funded outside of CERN). LHCb: Full information given to enable template to be adapted/completed. The group notes they will have to renormalise the resulting experiment numbers to a common set of assumptions on the LHC running conditions. The planning is to report on the scrutiny of the validity of the 2009 resource requests in August. The CSO has agreed that these can already be made public though more detail may be added for the November C-RRB. In future years there may be a C-RRB in the summer to review the Scrutiny Group reports for the following year given the need to start hardware procurements well in advance of need. The group also had a report on the results of the Common Computing Readiness challenge at its fourth meeting. The group will meet in August to finalise the 2009 reports then decide the date for one or more Autumn meetings when they see how the LHC is performing bearing in mind that they have to finally report to the C-RRB meeting of the 11 November.

OB • • The OB heard an LCG project status report from I. Bird,

OB • • The OB heard an LCG project status report from I. Bird, a CCRC’ 08 postmortem report from myself including a SWOT analysis and a report on procedural progress of the C-RSG from myself. The weaknesses are seen as: • Some of the services – including but not limited to storage / data management – are still not sufficiently robust. • Communication is still an issue / concern. This requires work / attention from everybody – it is not a one-way flow. • Not all activities (e. g. reprocessing, chaotic end-user analysis) were fully demonstrated even in May, nor was there sufficient overlap between all experiments (and all activities). • • The main Threat perceived by the WLCG management is that of falling back from reliable service mode into “fire-fighting” at the first sign of serious problems. However, a consistent message is being given that experiments, sites and WLCG are ‘more or less’ ready for the expected 2008 data taking although constant attention will be needed at all levels.

Site Reports (1/2) • CNAF: • 10 July submitted post-mortem on recent power and

Site Reports (1/2) • CNAF: • 10 July submitted post-mortem on recent power and network switch problems. Full services reported running by 19 July. • BNL: • 7 July primary link to TRIUMF failed due to outage in Seattle area and failover to secondary via CERN OPN did not come up. Workaround by turning off primary interface at BNL or TRIUMF but proper solution still being worked on. • 9 July storage server network connection failure took some time to solve changing various components. Left some ATLAS files inaccessible. • 14 July inaccessible file problem understood and put down to a problem introduced by dcache patch level 8. Files which for some reason failed to transfer out of BNL were pinned by dcache. SRM transfer first tries to pin files and gives up when it cannot. Other access methods work. Workaround is to periodically look for such pinned files and unpin them. No long term solution yet. Sites alerted but probably now being seen in IN 2 P 3 after P 8 upgrade.

Site Reports (2/2) • FZK: • 19 July at about 19: 20 a major

Site Reports (2/2) • FZK: • 19 July at about 19: 20 a major network router failed. Almost all services were affected. Some services were up again on Sunday but some are still degraded or unavailable (as of 13: 00 Monday). In particular, some d. Cache pool nodes are not yet available. We are working on it and a post-mortem analysis will follow. • General: • 17 July GGUS conducted first service verification of the Tier 1 site operator alarm ticket procedure. Failures of the procedure at NDGF and CERN are understood and being fixed.

Experiment reports (1/3) • LHCB: • DC 06 simulation is running smoothly under Dirac

Experiment reports (1/3) • LHCB: • DC 06 simulation is running smoothly under Dirac 3 but reconstruction and stripping tests are still ongoing so there is no official date yet for the start of DC 06. • ALICE: • Production hit by myproxy problems – see PM at https: //twiki. cern. ch/twiki/bin/view/FIOgroup/Sc. LCGPx. Operations • Working on integration of CREAM-CE with ALIEN.

Experiment reports (2/3) • CMS: • CRUZET 3 cosmics run from 7 to 14

Experiment reports (2/3) • CMS: • CRUZET 3 cosmics run from 7 to 14 July. Quite good experience, more mature in terms of data handling in general. Reconstruction submissions to all Tier 1 ongoing. Preparing for next global cosmic exercise for the second half of August but expect cosmics data tests weekly Wed+Thur. • Work finalized on the P 5 ->CERN transfer system, a repacker replay is now running (since July 17 th), namely redoing the repack for CRUZET-3 data. Plans: Next monday CMS will start more replays with some T 0 real prompt reco testing. • CMS would expect a centrally-triggered big transfer load of many CSA 07 MC datasets to CMS T 2's, as a needed step in order to complete the migration of the user analysis to T 2 sites. Each T 2 should expect to be asked to host a fraction of ~30 TB of those datasets. • CMS have a CASTOR directory of 2. 3 * 10**6 files of 160 KB which are webcam dumps and have gone to tape. They are looking at deleting them and stopping fresh ones.

Experiment reports (3/3) • ATLAS: • CERN CASTORATLAS upgraded to 2. 1. 7 -10

Experiment reports (3/3) • ATLAS: • CERN CASTORATLAS upgraded to 2. 1. 7 -10 on 14 July to avoid a fatal data size overflow problem. • ATLAS taking cosmics with test triggers resulted in some very large datasets being (successfully) distributed to BNL. • ATLAS now running cosmics at weekends and on 20 July ATLAS CERN site services stuck but resulting T 0 to T 1 catchup when services restarted Monday morning reached an impressive 2. 5 GB/sec. Clearly more process monitoring alarms are needed. • ATLAS workflow management bookeeping needs process level access to their elog instance (via an elog api call) and this is about to be made available after a security analysis.

Summary Solid progress on many fronts 10

Summary Solid progress on many fronts 10