First test of the Po C Caveats I

Caveats • I am not a developer ; ) • I was also beta

What I tested (with both) • A complicated workflow: the official (V)H->bb analysis step

Where I tested • CRAB 3/Panda: test is restricted to few sites (FNAL, Pisa,

Moreover • Po. C is not expected to provide full Crab 3 functionality, just

Panda • • • • • • Configs from WMCore. Configuration import os from

Soon after submit bash-3. 2$ crab status -t crab_20121127_113729 -i Registering user credentials Task

Few Considerations • Let’s start from the obvious: with both systems I reached 100%

ASO • It worked flawlessly in both cases • Nothing more to say I

Issues with Panda • Kill did not work for me; I understood it was

Is resubmit working fine? • In both cases, it was for me • Caveat:

Let’s go straight to the point • Up to here executive summary could be:

What is different • Panda Monitoring seems by far better than what we are

…Plus WMStats Some debugging info added, but not that much (where is the WN

Features we usually do not have • All the log (pilots + stderr +

logs Full logs uploaded to SE (full logs present, not just snippets guessed as

Other features I liked • Panda seems user friendly when scheduling jobs: if you

Conclusions? • As said, functionally both were doing what asked – PANDA does not

Slides: 20

Download presentation

First test of the Po. C

Caveats • I am not a developer ; ) • I was also beta tester of Crab 3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to 1 comparison • The first 2 weeks of the Po. C test were mainly – Finding a problem – Communicating the developers – Getting a new version – Trying again – I simply skip this part, which is ok; I speak about the results after all the fixes

What I tested (with both) • A complicated workflow: the official (V)H->bb analysis step 1 (see https: //twiki. cern. ch/twiki/bin/view/CMS/VHbb. A nalysis. New. Code#Ntuple. V 42_CMSSW_5_3_3_pat ch 2 ) which takes ~2 hours just to compile – Indeed ISB ~ 45 MB, with 56 user compiled libraries • Running on dataset /Double. Electron/Run 2012 BPrompt. Reco-v 1/AOD – 40 LS/job -> ~ 1200 jobs, a couple of hours each

Where I tested • CRAB 3/Panda: test is restricted to few sites (FNAL, Pisa, DESY, …) – The sample is indeed just in FNAL and Pisa among the Po. C sites • CRAB 3/WMA: 8 T 2 s available, some of poor quality (T 2_RU_*) • Always used Pisa as storage site

Moreover • Po. C is not expected to provide full Crab 3 functionality, just (as in the email I got) – Submit – Resubmit – Kill – Status – Getoutput – Getlog So I stick to these also for Crab 3/WMA (i. e. I do not do DBS publication)

Panda • • • • • • Configs from WMCore. Configuration import os from datetime import datetime config = Configuration() config. section_("General") config. General. server. Url = 'poc 3 test. cern. ch’ config. General. ufccache. Url = 'cmsweb-testbed. cern. ch’ config. section_("Job. Type") config. Job. Type. plugin. Name = 'Analysis' config. Job. Type. pset. Name = 'pat. Data. py’ config. section_("Data") config. Data. input. Dataset = '/Double. Electron/Run 2012 BPrompt. Reco-v 1/AOD' config. Data. publish. Data. Name = os. path. basename(os. path. abspath('. ')) +"_tom" config. Data. lumi. Mask = 'Lumi. json’ config. Data. publish. Dbs. Url = "https: //cmsdbsprod. cern. ch: 8443/cms_dbs_ph_analysis_02 _writer/servlet/DBSServlet" config. Data. splitting = 'Lumi. Based' config. Data. units. Per. Job = 40 config. section_("User") config. User. email = ’’ config. section_("Site") config. Site. storage. Site = 'T 2_IT_Pisa' WMA • • from WMCore. Configuration import os config = Configuration() config. section_("General") config. General. request. Name = 'request_name 2' config. General. server. Url = 'crab 3 -test. cern. ch' config. General. ufccache. Url = 'cmsweb. cern. ch' • • • config. section_("Job. Type") config. Job. Type. plugin. Name = 'Analysis' config. Job. Type. pset. Name = 'pat. Data. py' • • config. section_("Data") config. Data. input. Dataset = '/Double. Electron/Run 2012 BPrompt. Reco-v 1/AOD’ config. Data. splitting = 'Lumi. Based' config. Data. units. Per. Job = 40 config. Data. lumi. Mask = 'Lumi. json’ config. section_("User") config. User. email = ’’ config. section_("Site") config. Site. storage. Site = 'T 2_IT_Pisa' • •

Soon after submit bash-3. 2$ crab status -t crab_20121127_113729 -i Registering user credentials Task name: tboccali_crab_20121127_113729_121127_103859 Panda url: http: //panda. cern. ch/server/pandamon/query? job=*&jobset. ID=19 &user=Tommaso%20 Boccali Details: running 0. 78 % (10/1279) activated 99. 22 % (1269/1279) Information per site are not available. Log file is /afs/cern. ch/work/b/boccalio/Po. C/CMSSW_5_3_3_patch 2/src/VHb b. Analysis/Hbb. Analyzer/test/Po. CTests/crab_20121127_113729/cra b. log No information per site, link to monitoring present bash-3. 2$ crab status -t crab_request_name 2 -i Registering user credentials Task Status: running Using 7 site(s): Jobs Details: submitted 100. 00 % ( running 44. 31 % pending 55. 69 % ) T 2_US_Florida: submitted 14. 58 % T 2_FR_GRIF_IRFU: submitted 14. 58 % T 2_RU_JINR: submitted 14. 58 % T 2_UK_London_IC: submitted 12. 54 % T 2_FR_GRIF_LLR: submitted 14. 58 % T 2_IT_Pisa: submitted 14. 58 % T 2_ES_IFCA: submitted 14. 58 % Log file is /afs/cern. ch/work/b/boccalio/Po. C/CMSSW_5_3_3_patch 2/src/VHb b. Analysis/Hbb. Analyzer/test/Crab 3 Tests/crab_request_name 2/crab. log (no link to dashboard? ) – one has to find by hand

Few Considerations • Let’s start from the obvious: with both systems I reached 100% done, with some “resubmit” (site problems) • Feature: with Panda a resubmit is a second task (with a second web page)… Not used to it but not a critical issue (you need just to get used to it)

ASO • It worked flawlessly in both cases • Nothing more to say I guess … • (I did not even need to look into the ASO monitoring) • You can get the files before ASO operated (I guess lcg-cp is used, …)

Issues with Panda • Kill did not work for me; I understood it was simple timeout to be set to a different threshold, did not check more

Is resubmit working fine? • In both cases, it was for me • Caveat: the Po. C enabled sites are generally good/very good. No chance to test a massive failure scenario

Let’s go straight to the point • Up to here executive summary could be: • “Limiting the scenario to what the Po. C is supposed to allow me to do, PANDA performs at least as well as WMA” • (again, this _after_ the two weeks of initial testing)

What is different • Panda Monitoring seems by far better than what we are used to

Dashboard/WMA… (as usual)

…Plus WMStats Some debugging info added, but not that much (where is the WN name? where is the LSF id? )

Features we usually do not have • All the log (pilots + stderr + stdout) are on the web – All: not only snippets for failed jobs – I guess ph support would love it, instead of asking to upload logs – support can get all the info from WEB, no need to ask the (maybe not too skilled user) – Snippets are not ok in general: a failure can be dependent from a bad Env Variable … cannot be seen from the snippet alone • There is link PILOT <-> LSF id ! • This I considered lost since we left g. Lite, and it is a MAJOR help to debug strange problems (like WNs acting as black holes)

Pilot log WN LSF id

logs Full logs uploaded to SE (full logs present, not just snippets guessed as interesting by the system)

Other features I liked • Panda seems user friendly when scheduling jobs: if you submit a task, even if your priority is very low, a few jobs are executed almost immediately, allowing you to spot broken workflows in advance • It seems I can resubmit at any time (no need to wait for task in cooloff …) – Is it because ACDC is not in the game? Is there anything we pay for this (side effects I am not aware of? )

Conclusions? • As said, functionally both were doing what asked – PANDA does not look at all behind • I cannot speak about what is NOT supposed to be in Po. C (which is not a small subset) • The major differences to me are – Monitoring: way better in Po. C with full disclosure of all the info – The early prioritization of some jobs is a lot of help (goes far beyond simple python sanity check) – You seem to be able to resubmit any time – no cool off needed; this potentially cuts the time to process tails