CMS Service Challenge 4 and beyond Stefano Belforte
CMS Service Challenge 4 and beyond Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 1
CMS SC 4 targets/milestones l CMS has set targets for SC 4 since beginning of 2006 Ø Communicated them to all our sites Ø Posted on Twiki, Web, Agenda’s, mails… l Aimed to smoothly roll in a WLCG service that CMS can use in October 2006 to test computing flow at 25% of 2008 scale (CSA 06) l Establish early in the year baseline of non-demaning continous file transfers (20 MB/sec per site disk-only). Avoid SC 3 syndrome. l Targets for June Ø Have all CMS Tier 1 and most Tier 2 on board Ø Demonstrate data transfers at realistic rates (tape at T 1) F 150 MB/sec out of CERN total (600 MB/sec in 2008) Ø Ramp up job submission rate over WLCG to 25 K jobs/day F 90% submission efficiency F double to 50 K jobs/day by October (200 K target in 2008) Ø 1 MB/sec/execution-slot from disk to CPU Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 2
Spring summary l Could not establish working baseline in the spring Ø in spite of start of activiyt in early April Ø srm. Copy, srm push/pull, Castor 2, WLCG tests, FTS, g. Lite 3. 0 … l Used the time to get ready for SC 4: Ø FTS integrated in Ph. EDEx Ø Develop job submission/monitoring tools Ø Integrate in new CMS Software framework l Started SC 4 at beginning of June Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 3
June summary l Steady progress, but not there yet l Finally have all Tier 1’s and 20 Tier 2’s (including perspective ones) on board: run one CMS simulation job, transfer 100 GB of CMS data, run one CMS analysis job on local data l Transfers: a few channels work, mostly below target rate Ø much better on OSG side. Ø Ad hoc effort started in coordination with WLCG, daily meetings, several CMS persons involved l Job submission rate: reached what seems to be LCG RB scale limit (5 K jobs/day/RB) l 90% job success Ø OK for WMS (OSG and EGEE) Ø OK for single sites at certain times Ø Not achieved yet on all sites for the same day (not even all T 1’s) l Not started measuring disk/CPU throughput Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 4
T 0 T 1 transfers Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 5
Job submission rates l Imperial College WMS monitoring list jobs by RB each day for all VO Ø http: //gridportal. hep. ph. ic. ac. uk/rtm/reports. html l Can hardly find an RB with more then 5 K jobs/day Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 6
Job success rate by site: two typical days 75% success overall Jobs not sent to CE since yet, no LB news (overloaded RB) Sites go in/out BDII and RB can not match Two sites with mware problems CNAF lost cooling Stefano Belforte INFN Trieste CMS SC 4 etc. Random failures at manageable level July 5, 2006 7
Sites issues l We find sites in general responsive to CMS needs and problems Ø Thank you !!! l g. Lite 3. 0 not really there yet as a production service Ø FTS servers/channels not configured/exercised using CMS end points. Data transfer is an end-to-end activity. Ø Hints of configuration problems in g. Lite 3. 0 LCG flavor CE F All CMS jobs fail with Maradona error on selected CE’s. – other VO OK, SFT all green – a couple sites kept failing all jobs since > 2 weeks – a 3 rd site apparently went back to LCG 2_7_0 CE after a few days l Not using the same infrastructure as WLCG throughput tests yet Ø E. g. Castor 1 instead of Castor 2 at Tier 1’s F Could not get a clean schedule for Castor 2 avaibility for CMS Ø Want to be on final production infrastructure asap F Not be connected to something that in your opinion will not work Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 8
Summer plans l Keep up basic SC 4 activity: Ø File transfers up to target, then stay there Ø Job submission at ~12 Kjob/day now, 10 Kj/d demonstrated on OSG using Condor-G earlier on l Add more complexity Ø Simulate 1 M event/day in July/August to prepare for CSA 06 Ø Commission g. Lite WMS (RB, maybe also CE) to reach 50 K jobs/day without 10 RB’s Ø Add calibration/conditions data remote access Ø Test data serving throughput at sites (disk CPU) Ø Allocate CPU resources separately for production and analysis jobs. Grid should not be a global FIFO queue l It will be a busy summer Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 9
After the Summer l l CSA 06 in October 2006 From CERN disk to Tier 0 to Tier 1 and Tier 2 Demonstrating reconstration, analysis, calibration, reprocessing At 25% of 2008 scale Ø 35 -40 Hz Ø Over a month Ø Would like to try higher througput l Well captured in Harry Renshall’s twiki Ø https: //twiki. cern. ch/twiki/bin/view/LCG/SC 4 Experiment. Plans l Ramp up to 2008 pledges Ø We caution sites against planning for a large increase in capacity in a short time. Every x 2 is a challenge Ø True for Tier 1 and Tier 2 Ø We look forward to stress test sites as soon as resources are available Stefano Belforte INFN Trieste CMS SC 4 etc. July 5, 2006 10
- Slides: 10