Simulation in a Distributed Computing Environment S Guatelli

Simulation in a Distributed Computing Environment S. Guatelli 1, P. Mendez Lorenzo 2, J. Moscicki 2, M. G. Pia 1 1 INFN Genova, Italy 2 CERN, Geneva, Switzerland IEEE Nuclear Science Symposium San Diego, 30 October – 4 November 2006 S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Speed of Monte Carlo simulation Speed of execution is often a concern in Monte Carlo simulation Often a trade-off between precision of the simulation and speed of execution Typical use cases Semi-interactive response Detector design Optimisation Oncological radiotherapy Very long execution time High statistics simulation High precision simulation Fast simulation Methods for faster simulation response S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia Variance reduction techniques (event biasing) Inverse Monte Carlo methods Parallelisation

Requirements Architectural requirements Transparent execution in sequential/parallel mode Transparent execution on a PC farm and on the Grid Semi-interactive simulation High statistics simulation e. g. Geant 4 brachytherapy Execution time for 20 M events: 5 hours Goal: execution time ~ few minutes e. g. Geant 4 medical_linac Execution time for 109 events: ~10 days Goal: execution time ~ few hours Reference: sequential mode on a Pentium IV, 3 GHz S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Features of this study Geant 4 in a distributed computing environment – – Architecture Implications on Geant 4 simulation applications Environments – – PC farm GRID Test on a single CPU Test on a dedicated farm (60 CPUs) Test on a farm shared with other users Test on the GRID (LCG) Real life use case – Geant 4 brachytherapy Advanced Example S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Parallel simulation execution: local cluster / GRID Both applications have the same computing model – – A job consists of a number of independent tasks which may be executed in parallel Result of each task is a small data packet (few kb), which is merged as the job runs In a local cluster – – Computing resources are used for parallel execution Input data for the job must be available on site Typically there is a shared file system and a queuing system Network is fast GRID computing: resources from multiple computing centres – – – Typically there is no shared file system (Parts of) input data must be replicated in remote sites Network connection is slower than within a local cluster S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

ategy to minimise the cost of migrating a Geant 4 simulation to a distributed environ DIANE Master-Worker architectural pattern http: //cern. ch/DIANE The Geant 4 application developer is shielded from the complexity of underlying technology via DIANE S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Distributed Simulation Interface class which binds together Geant 4 application and Master-Worker framework Original Geant 4 application UML Deployment Diagram for Geant 4 applications source code unmodified G 4 Simulation class responsible of managing the simulation random number seeds Geant 4 initialisation termination S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Practical example: Geant 4 simulation with analysis Each task produces a file with histograms The job result is the sum of histograms produced by tasks Master-worker model – client starts a job – workers perform tasks and produce histograms – master integrates the results Distributed Processing for Geant 4 Applications – – – task = N events job = M tasks may be executed in parallel tasks produce histograms/ntuples task output is automatically combined (add histograms, append ntuples) Responsibilities – – – Master steers the execution of job, splits the job and merges the results Worker initializes the Geant 4 application and executes macros Client gets the results S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Overhead at initialisation/termination Test on a single dedicated CPU (Intel ®, Pentium IV, 3. 00 GHz) Study execution via DIANE w. r. t. sequential execution – run 1 event Standalone application Application via DIANE, simulation only 4. 6 0. 2 s 8. 8 0. 8 s Application via DIANE, with analysis integration 9. 5 0. 5 s Overhead: ~ 5 s, negligible in a high statistics job S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Farm: execution time and efficiency Dedicated farm : 30 identical bi-processors (Pentium IV, 3 GHz) – – Thanks to Regional Operation Centre (ROC) Team, Taiwan Thanks to Hurng-Chun Lee (Academia Sinica Grid Computing Center, Taiwan) Load balancing: optimisation of the number of tasks and workers S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Optimizing the number of tasks The job ends when all the tasks are executed in the workers If the job is split into a higher number of tasks, the chance that the workers finish the tasks at the same time is a higher Note: the overall time of the job is determined by the last worker to finish the last task Worker number Example of a job that can be improved Time (seconds) from a performance point of view S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia Worker number Example of a good job balancing Time (seconds)

Farm shared with other users Real-life case: farm shared with other users Execution in parallel mode on 5 workers of CERN LSF Preliminary! DIANE used as intermediate layer The load of the cluster changes quickly in time The conditions of the test are not reproducible Highly variable performance S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Results in real life use case Required production of Brachytherapy: 20 M events in sequential mode: 16646 s (~ 4 h 38’) on an Intel ® Pentium IV, 3. 00 GHz The same simulation runs in 5’ in parallel on 56 CPUs – appropriate for clinical usage S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

How the GRID load changes Execution time of Brachytherapy in two different conditions of the GRID Worker number Time (seconds) 20 M events, 60 workers initialized, 360 tasks Time (seconds) Very different result! The load of the GRID changes quickly in time The conditions of the test are not reproducible S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Test results (LCG) with/without DIANE Execution on the GRID, without DIANE Worker number Execution on the GRID through DIANE, 20 M events, 180 tasks, 30 workers Worker number Time (seconds) Without DIANE: 2 jobs not successful due to set-up problems of workers Through DIANE: - All the tasks are executed successfully on 22 workers S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Farm/GRID execution Brachytherapy application, 20 M events, 180 tasks Taipei cluster: 29 machines, 734 s ~ 12 minutes GRID: 27 machines, 1517 s ~ 25 minutes Preliminary indication The conditions are not reproducible S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Lessons learned DIANE as intermediate layer – – Transparency Good separation of the subsystems Good management of CPU resources Negligible overhead Load balancing – A relatively large number of tasks increases the efficiency of parallel execution Trade-off between optimisation of task splitting and overhead introduced Controlled and real life situation is quite different in a farm – Need dedicated farm for critical usage (i. e. hospital) Grid – – – Highly variable environment Not mature for critical usage yet Work in progress, details still to be understood quantitatively S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia

Conclusions General solution for Geant 4 simulation in a distributed computing environment – – – transparent sequential/parallel application transparent execution on a local farm or on the Grid user code is the same Quantitative results – on-going work to understand details Acknowledgments to: – M. Lamanna, L. Moneta, A. Pfeiffer (CERN) – LCG teams at CERN – Hurng-Chun Lee (ASGC, Taiwan) – Regional Operation Centre Team of Taiwan S. Guatelli, P. Mendez Lorenzo, J. Moscicki, M. G. Pia