The High Performance Simulation Project Status and short























- Slides: 23
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati
SFT So. FTware Development for Experiments Where are we now? n Present q q status Several investigations of possible alternatives for “extremely parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to prove vectorization gain New design in preparation
SFT So. FTware Development for Experiments Major points under discussion How to minimise locks and maximise local handling of particles q How to handle hit and digit structures q How to preserve the history of the particles q n This point seems more difficult at the moment and it requires more design What is the possible speedup obtained by microparallelisation q What are the bottlenecks and opportunities with parallel I/O q
SFT So. FTware Development for Experiments Current design p array Thread local Logical Volume p Transport Logical Volume basket p* array Output particle store p* Dispatcher thread p* p* p* array 4
SFT So. FTware Development for Experiments Features q Pros n n n q Good parallel performance but… Easy recording of particle history Limited data movement Cons n n n Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality 5
SFT So. FTware Development for Experiments Design under study List of logical Volumes Logical Volume lv List of baskets for lv p array Input particle list p array Output particle list Active event list History List of active events for lv Hits Sensitive volumes Event ev Digits for lv and event ev 6
SFT So. FTware Development for Experiments Design under study List of logical Volumes Logical Volume lv Ev build thread Events List of baskets for lv p array Input particle list p array Output particle list Transport thread Active event list History List of active events for lv Hits Event ev Digits for lv and event ev Digitizer thread Sensitive volumes 7
SFT So. FTware Development for Experiments Design under study List of logical Volumes Logical Volume lv Continuously rotated List of baskets for lv p array Input particle list p array Output particle list Active event list Flushed at the end of event History List of active events for lv Hits Sensitive volumes Event ev Digits for lv and event ev 8
SFT So. FTware Development for Experiments Features q Pros n n q Excellent potential locality Easy to introduce hits and digits Cons n n One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and introduce particle pruning 9
SFT So. FTware Development for Experiments Processing flow I q The transport thread takes particles from the input buffer and transports them till they stop, interact or exit from the volume n n n q At this point they are inserted in the output particle buffer for further processing If the LV is a sensitive detector, hits are generated and stored per LV basket A LV basked history record is kept (we have no idea how for the moment, we need more blackboard work!) Input and output particle buffers are fixed size structures, which can however evolve (be optimised) during simulation 10
SFT So. FTware Development for Experiments Design under study List of logical Volumes Logical Volume lv List of baskets for lv p array Input particle list p array ✗ empty! Output particle list ✔ full! Active event list History List of active events for lv Hits Sensitive volumes Event ev Digits for lv and event ev 11
SFT So. FTware Development for Experiments Processing flow II q When an input particle buffer is exhausted n n q It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible memory waste Can be passed to a queue of “used baskets”, but this implies a lock In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones used, but this can be optmized Used buffers are scanned by the dispatcher thread that updates a global track counter per event n -1 for each stopped “dead” particle And then they are declared “empty” to be reused q The transport thread picks up another “ready” basket q 12
SFT So. FTware Development for Experiments Processing flow III q When an output particle buffer is full, it is marked as such n n Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all basked->all output buffers to know which ones are full, but this can be optmized The transport thread picks another empty output buffer q The dispatcher thread copies particles from the full output particle buffer to LV-specific input particle buffers q n q Increasing the global particle event counter When an input particle buffer is full, the dispatcher declares it “ready to be transported” 13
SFT So. FTware Development for Experiments Design under study Full buffer list Empty buffer list Ready buffer list List of logical Volumes Logical Volume lv List of baskets for lv p array Input particle list p array Output particle list 14
SFT So. FTware Development for Experiments Processing flow IV q Note an important point The LV basket structure has input and output particle buffers and hits and history buffers q Input and output particle buffers are n n n q Multi-event Volatile, they get emptied and filled during transport of a single event Hits and history buffers are n n Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data structure till the event is flushed 15
SFT So. FTware Development for Experiments Processing flow V When an event is finished, the digitizer thread kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure q When this is over, the event is built into the event structure (to be designed!) by the event builder thread q After that, the history for this event is assembled by the same thread q Then the event is output q 16
SFT So. FTware Development for Experiments Questions? q How many dispatcher, digitizer and event-builder threads? n n q Difficult to say, we need some more quantitative design work Measurements with G 4 simulations could help Transport thread numbers will have to adapt to the size of simulation and of the detector n n In ATLAS for instance 50% of the time is spent in 0. 75% of the volumes Threads could be distributed proportionally to the time spent in the different LVs 17
SFT So. FTware Development for Experiments Simple observation: HEP transport is mostly local ! 50 per cent of the time spent in 50/7100 volumes • Locality not exploited by the classical transportation approach • Existing code very inefficient (0. 60. 8 IPC) • Cache misses due to fragmented code
SFT So. FTware Development for Experiments Questions? What about memory? q Fortunately we do not have “that many” LVs q Detector Physical volumes Logical volumes ALICE 4, 354, 735 4, 764 ATLAS 29, 046, 966 7, 143 CMS 1, 166, 318 1, 537 LHCb 18, 491, 756 709 19
SFT So. FTware Development for Experiments Grand strategy Simulation job Create vectors But we should look also here We are concentrating here Use vectors Basic algorithms 20
SFT So. FTware Development for Experiments Short term tasks n Continue the design work – essential before any more substantial implementation q q n n Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines q n This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the implementation Both on x 86 CPUs and GPUs Demonstrate speedup of some chosen physics methods q Particularly in the EM domain 21
SFT So. FTware Development for Experiments Possible timeline n Summer 2013 q q Implement a prototype according to the present design Get essential numbers from G 4 (to be defined!) n q q n Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc. Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard) Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G 4? ) Fall 2013 q Interface the methods above to the prototype to realise a first protype of vectorized transport 22
SFT So. FTware Development for Experiments Thank you!