Eic MC EICspecific Google Protocol Buffers MonteCarlo file
Eic. MC: EIC-specific Google Protocol Buffers Monte-Carlo file format Alexander Kiselev EIC R&D Software Consortium Meeting BNL February, 09 2017
Motivation n Our October’ 2016 meeting: n n Want to exchange MCEG files in a non-ROOT and non-ASCII format Bring all existing EIC MC generator files to a “common denominator” -> suggestion: adapt existing Pro. MC library to do the job n Certain progress in this direction made at the beginning: n n Generator-neutral part is incorporated in Eic. Root framework (as an extra input file format for “pure” GEANT transport purposes) EIC MCEG-specific info encoding in Pro. MC faced difficulties: n n Pro. MC is primarily Pythia-oriented -> no elegant way to extend. proto files to maintain say MILOU-specific event-per-event variables Few other small (and partly fake) issues identified with the Pro. MC format (floating-point precision, default 64 k record limit, inefficient storage of typically small EIC events, external dependencies, etc) Feb, 9 2017 A. Kiselev 2
Google Protocol Buffers n Active project, maintained and internally used by Google n Long-term support guaranteed as long as Google is there “Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. ” message My. Message { string name = 1; int 32 id = 2; repeated float data = 3; } Feb, 9 2017 message->set_name(“Crap”); message->set_id(777); message->add_data(3. 14159); message->add_data(2. 71828); message->Serialize. To. Stream(stream); // or such A. Kiselev 3
Google Protocol Buffers n Provide certain degree of flexibility in data description n n All basic data types as well as nested messages (structures) Some sort of unions and even STL maps Custom data definition language (. proto files) and converter to C++ and other languages (resembles ADAMO, ROOT dictionaries, etc) As long as simple rules are observed when extending the format: n n Backward compatibility is maintained (add new variables to a. proto file, recompile -> new executable still able to read old files with missing variables) Forward compatibility is maintained (add new variables to a. proto file, create new file in this “extended” format -> old executable still able to read new files in part, which it is aware of) -> yet looks like step back towards the stone age compared to ROOT; fine Feb, 9 2017 A. Kiselev 4
Eic. MC in brief n n Small standalone C++ library (~3 k lines of code total) No external dependencies on the user (import) side n n Except for the Google protobuf libraries, of course MCEG -> Eic. MC converter is realized through eic-smear interface though, therefore ROOT is required n Portability: tried the codes out on SL 6 and OS X Mavericks n ~512 MB max single event size if the only “real” built-in limit -> the rest of the presentation will be “Eic. MC vs Pro. MC” in a “snapshot” fashion Feb, 9 2017 A. Kiselev 5
Binary file layout n Eic. MC Individually zipped event records in Google protobuf message format with a top-level directory structure provided by the third party library (with its own issues) “Native” stream of lengthdelimited event and service records (sparsification tables, direct access catalogues) in Google protobuf message format Pro. MC: n n Pro. MC Event records can be decoded independently (so per definition no complications with direct access mode) Eic. MC: n n Event records are independent from each other, but require extra information (sparsification tables) for decoding Optional compression using respective flavor of google protobuf stream is possible (and events can be merged together while zipping) Feb, 9 2017 A. Kiselev 6
Direct access to the event records n Eic. MC Top-level “linear” directory structure and respective Skip() and Seek() calls provided by the third party library Multi-dimensional direct access tables are injected in the event message stream as separate custom records Pro. MC: n n Pro. MC Should be faster (linear catalogue structure with direct access to individual zipped event records) compared to Eic. MC default mode Eic. MC: n n n Must be much slower in default mode (layered structure with direct access to the typically coalesced chunks of zipped event records) Should however be “infinitely” scalable If scalability and file size are of no concern, a fall-back a la Pro. MC mode can be imitated (individually zipped events and 1 D addressing) Feb, 9 2017 A. Kiselev 7
Self-description (whatever it means) n Pro. MC Eic. MC Relevant collection of. proto files can be included by hand as individual zipped records and can be retrieved later Base Record message structure matching the current library is automatically included in the file header Eic. MC: n n Technically the Record structure is very similar to a. proto file There are user calls provided, which allow one of the following: n n Feb, 9 2017 Build message structure on the fly (reflection) and retrieve variables by name -> hardly of any practical use, but allows to claim the “true” self description feature of this file format Dump a proper. proto file (which in addition to the Record message contains gzip file header extension layout description), which can be used to compile a library with the message structure exactly matching this particular binary file A. Kiselev 8
MCEG event records n Common to all generators: n n Momentum components Vertex coordinates and time Status, PDG, mother(s), daughter(s) Nasty part: event-per-event generator-specific variables n n Hardcode them all as event sub-headers in the. proto file? Use some sort of {tag, value} maps? -> NB: in the “ideal” ROOT-based eic-smear world these are inherited C++ classes Feb, 9 2017 A. Kiselev 9
Philosophy of MCEG info inclusion Pro. MC Eic. MC Create separate. proto files for different MC generators (see promc & nlo examples) and compile custom library version(s) accordingly Use identical. proto file for all generators; generator-specific info for individual events is added via sparsified {tag, value} maps event->Add. Float. Value(“true. Y”, 0. 95); Pro. MC: n n n This default implementation does not allow two different formats to be compiled in at once (which definitely limits the useability) Optional: add plain {tag, value} arrays on event-per-event basis Would be fine for the file header; for individual events must be pretty inefficient (? ) n n Eic. MC: n Convertor for all so far known DIS MCEG already implemented n n Requires ROOT and eic-smear Both floats and int 64 values, as well as tagged arrays can be packed Feb, 9 2017 A. Kiselev 10
Floating point precision n Pro. MC Eic. MC Momentum and coordinate values are stored as signed integers in units of userspecified resolution Both a la Pro. MC storage mode and double (single) precision possible and can be selected via user calls when file is created Eic. MC: n n Double-precision floating point user interface … therefore 64 -bit default storage mode for {px, py, pz; x, y, z, t} n n unless the actually provided values are “by mistake” given in single precision (which can be checked easily), then stored in a 32 -bit floats “basket” Pro. MC-like storage mode (“fixed” precision, say “keep momenta with precision up to 1 ke. V/c only”) is also possible … n … in which case values are stored in a variable length 64 -bit integers “basket” Feb, 9 2017 A. Kiselev 11
User interface n Pro. MC Eic. MC Internal google protobuf message structure is partly exposed to the end user Internal event structure is completely hidden from the end user Eic. MC: n Basically the whole collection of expected high-level calls is provided: n n n Get. Next. Event() event->Get. Particle. Count() event->Get. Particle(i) particle->Get. Px(), etc … while event is automatically unpacked from a protobuf message “in the background” Feb, 9 2017 A. Kiselev 12
Packaging n Pro. MC Eic. MC Provided with a local copy of google protocol buffer software as well as a local copy of third party zipping library, etc Bare custom codes; expects google protocol buffer software (as well as optionally ROOT & eic-smear) to be pre-installed Eic. MC: n Can be changed of course; but that’s the today status Feb, 9 2017 A. Kiselev 13
Sparsification and compression n Pro. MC Eic. MC Uses relatively simple event message layout almost without pre-processing; lets zlib do the compression job Uses a bit over-complicated event message layout with heavy (optional) sparsification; zlib compression is also optional Eic. MC: n n Can sparsify status code sequences, PDG entry sequences, 0. 0 values (primary vertices in particular), duplicate (up to the sign) momentum component values, duplicate vertex coordinates, beam particles, etc Configurable zlib compression of multi-event chunks is possible (and is the default mode) on top of this pre-processing -> whether this complication is really needed remains a question; but it does not hurt (and also “easy” packing mode is still possible) Feb, 9 2017 A. Kiselev 14
Performance Ideally would like to benchmark ROOT vs Pro. MC vs Eic. MC n Hard to compare apples to apples though: n n n Which floating point precision was used? Was the file optimized for size or import (unpacking) speed? Was the file optimized for sequential or direct access? User code accesses all variables of the event record or only a few? Are MCEG-specific variables considered in comparison or not? Eic. MC (against Pro. MC, leave ROOT alone): n Sparsification -> competitive unzipped file format flavor n n Possibility to merge several (small) events in a single gzip record n n Improves import speed at a cost of a certain file size increase Minimizes file size at a cost of direct access performance … Feb, 9 2017 A. Kiselev 15
Next steps n n Finalize validation process Upload codes to Git. Lab n Optimize package configuration (CMake, etc)? Include few other converters (Hep. MC? ) & usage examples n Technically one can add other (non-MC) event types n Tune for Hep. Sim: file metadata, streaming, etc n Tune for GEANT (multi-threading, etc)? n Feb, 9 2017 A. Kiselev 16
- Slides: 16