Simulation Monte Carlo Production for CMS Stefano Belforte
Simulation (Monte Carlo) Production for CMS Stefano Belforte WLCG-Tier 2 workshop Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 1
Outline l What Monte. Carlo simulation is l How is it done at CMS l What Tier 2 role is l What services Tier 2 need to provide l Tutorials later this week to learn how exactly to setup and operate those services Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 2
Simulation Production l Simulation (Monte Carlo) Production means Ø Generate, simulate, reconstruct data that look like data coming from detector, i. e. as realistic as possible, but all simulated l Ideally: Ø Input is a configuration (few KB), output a dataset (TB) Ø Heavily CPU-bound activity l In practice: Ø Simulation, reconstruction, addition of multiple interactions (pileup) look more like analysis: data in – executable – data out Ø Some step may be I/O bound Ø CMS will try to combine the flow and run CPU bound jobs l Expect a lot of CPU bound jobs, but do not build a Tier 2 thinking only of CPU Ø Tier 2 also needs to run analysis anyhow Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 3
Simulation at Tier 2 l In CMS Computing Model and Computing-TDR, Tier 2’s are the place where MC production is run Ø http: //cmsdoc. cern. ch/cms/cpt/tdr/ l Tier 2 comply with this requirement by offering a computing service that is exploited by CMS MC Operation team to produce simulation datasets according to CMS policy (defined by CMS physics groups) l Tier 2 do not define autonomously what data needs to be produced l There will be “user level MC”, small production that satisfy the need of single user and do not go through physics groups scheduling Ø Could still be managed centrally or run directly by that user Ø To a Tier 2 it will look the same: a bunch of grid jobs l Tier 2 are not handed a list of jobs to run, they are used as service providers l Tier 2 are “good grid sites that offer CMS-needed services” Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 4
Simulation Workflow l Simulation is a grid based activity Ø Jobs are submitted centrally Ø Expect 2 -5 groups of operators to be able to keep the system busy and manage all production Ø Local submission is not forbidden, but un-economic l Automated tools will be used Ø Central request queue (Production. Manager) Ø Chunks of (thousands of) jobs are handed over to a few Production. Agents that submit, track, bookkeep, store output to final location and register in CMS databases l Each Production. Agent will need a human being (or two) Ø Ideally one Production. Agent can fill the grid Ø In practice humans behind it will need to deal with operational failures on the grid, hence need to have more people and split competences Ø E. g. one group for OSG, one for EGEE l Your customers will be remote users Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 5
Simulation Dataflow l Simulation is an expensive task (many CPU hours): precious output l Data will be stored at CMS Tier 1’s Ø Backup on tape Ø May need to re-process with new reconstruction Ø May need to distribute to more T 2 Ø Free disk while making it available even years later from tape l But job output does not go directly from WN to T 1: Ø Each WN output is stored locally Ø small files are merged locally (another set of central jobs) F Heavily I/O bound ! Ø A validation step is run locally Ø Large files are sent to T 1 Ø Sometimes the output is short lived, may stay a week at T 2, then get deleted l Full integration in CMS data management system needed Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 6
Resource management l In CMS Computing Mode Tier 2 also support majority of user analysis Ø Each Tier 2 more or less split: 50% analysis, 50% simulation l Both activities can easily fill whatever resources are there Ø Simulation is planned, can fill queues with weeks-worth of jobs Ø Analysis by users is dynamic: I need this by tomorrow l Need a way to partition resources l Grid tools are not so developed here, especially on EGEE side Ø You will need to help, at least initially Ø EGEE Pre. Production sites will be asked to setup an initial test configuration this summer, additional volunteers will be welcome l Simulation and analysis have difficult co-existance Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 7
Simulation Operation l Division of responsibility l CMS Operation team focus on global issues: Ø CMS application: configuration, version, result Ø Validation of result Ø Bookeeping in CMS data catalogs Ø Ranks sites as yes/no based on works/does-not-work Ø No need to know details of d. Cache or DPM l Tier 2 operation team focus on fabric issues: Ø Racks, network switches, file servers Ø Cooling, heating, power, repairs Ø Operating system, middleware installation, SFT Ø Local failures tracking/fixing Ø No need to know details of CMS Dataset provenance tracking e. g. Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 8
Tier 2 required grid services l Pass all Site. Functional. Tests l Working Computing Element Ø Good, solid, powerful batch farm l Also provide good storage Ø Size Ø Performance Ø Services l SRM access l FTS channels (run on a server at some T 1, but transfer is always a two end responsibility) l Good I/O needed (those I/O bound cases) l Good network (1 Gbit/sec) capability to move data fast on WAN while data is being read/written fast locally l CMS nominal Tier 2 (200 CPU, 200 TB): not trivial to operate Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 9
CMS services needed l CMS Software distribution available Ø Simple, but critical Ø Central installation, but problems must be fixed locally l Integration in CMS Data Management Ø Local (Trivial) File Catalog Ø Ph. EDEx (CMS’s own high level dataset transfer tool, will work on top of FTS) Ø DLS client (tracks in CMS catalog which data is at the site) l Even simulation jobs will need to access CMS calibration and condition Database (how to make realistic simulation otherwise ? ) Ø Need to deploy a Squid cache Ø CMS Database calls are mapped into the application to http calls Ø Squid is an open-source, widely-used, robust, high-performance, easy-to-install http cache server l Details on CMS services during tutorials Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 10
Bringing it all toghether l What are the really important things a Tier 2 has to do in order to support CMS Simulation operation ? l The answer is not: Ø Have lots of hardware Ø Even if that helps l Next slide Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 11
What CMS Tier 2 really need to do l Sites need to be up and running and care for themselves l They provide MC capacity as a service to remote operators and must be aware that those can not reach in and fix problems. Users can not login on your Worker nodes to run ps/dbg/kill ! Ø You have to help Ø fixing SW area Ø spotting hung jobs, forwarding error logs, traces Ø Spotting/removing bad nodes Ø Monitor hardware and remove before it breaks l A good Tier 2 is active, responsive, attentive, proactive Ø Spot problems before the users Ø Fixe them before the users have problems Ø Study usage patters, talk to users, understand needs and plan for the future, reports problems to CMS Computing l A Tier 2 provides services to CMS globally, but also to local community, and reflects the obligations of local community versus CMS Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 12
Optional Service l some site may host a "production operation unit" and so need to run Production. Agent l Or may decide to run a Production. Agent for the convenience of local users l Support will be available for running a Production. Agent locally Ø See tutorials Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 13
Conclusion l Tier 2’s are vital to CMS physics program l Supporting CMS Simulation Production is easy: Ø Just be a good grid site with a few easy additional services l But it is also important, and difficult: Ø Computers may stay up in unattended mode Ø But quality of services usually not Ø Continuous, proactive, intelligent care is needed l You are the ones who will make your site productive and efficient Ø No matter how fancy tools CMS develops centrally l A Tier 2 will succeed or fail based on the dedication of the people running it Stefano Belforte INFN Trieste CMS Simulation at Tier 2 June 12, 2006 14
- Slides: 14