Navigation Requirements CMS View of OIDs and Refs











- Slides: 11
Navigation Requirements CMS View of OIDs and Refs Vincenzo Innocente Lassi Tuura CMS Vincenzo Innocente, CERN/EP Persistency RTAG 1
Driving Requirements Ability to locate a persistent object even if relocated or multiply located Scattering: write in one file, relocate to many n Gathering: collect in one file interesting objects n Solution should possibly address all kind of data Event Data n Calibrations, Conditions, Geometry n Meta. Data themselves n Vincenzo Innocente, CERN/EP Persistency RTAG 2
Scenarios Production (simplify it) Write all data from a single process in a single file n Split later the data according to a clustering strategy n Access (transfer as little as possible) Select and reprocess events from a large sample n Get a local persistent copy of just what needed n Ensure full event navigability at every stage of analysis n Typical query n n Give me the closest (actually in the fastest way) collection of tracks compatible with this configuration belonging to the events satisfying these criteria Vincenzo Innocente, CERN/EP Persistency RTAG 3
Event Model (Event) Data Product (100 -1000 per event) n chunk of (event-) data managed as a single unit n Collection of Digis belonging to a part of a detector n Collection of Rec. Obj (track, calo-clusters, jets) produced by a given algorithm n Currently identified by (its objy OID and federaton) n Event n ascii string n “metadata” describing how it was produced n Its “transient” type n Physical location n Inter Data Product dependency tracked n Consistency Vincenzo Innocente, CERN/EP ensured Persistency RTAG 4
Production 2002, Complexity Number of Regional Centers 11 Number of Computing Centers 21 Number of CPU’s ~1000 Largest Local Center 176 CPUs Number of Production Passes for each Dataset (including analysis group processing done by production) Number of Files ~11, 000 Data Size (Not including fz files from Simulation) File Transfer by GDMP and by perl Scripts over scp/bbcp Vincenzo Innocente, CERN/EP 6 -8 Persistency RTAG 17 TB toward T 1 4 TB toward T 2 5
HEP Data User Tag (N-tuple) Event-Collection Meta-Data Environmental data n Detector and Accelerator status n Calibrations, Alignments (luminosity, selection criteria, …) … Event Data, User Data Collection Meta-Data Tracker Alignment Tracks Event Collection Ecal calibration Electrons Event Navigation is essential for an effective physics analysis Complexity requires coherent access mechanisms Vincenzo Innocente, CERN/EP Persistency RTAG 6
Re-Reconstruction & Clones Production Id 1 User Id 2 Rec. Event Run and Config Id 12 Digi. Event Tracker Ecal Local Replica Hcal Vincenzo Innocente, CERN/EP Ecal (Partial) Re-reconstruction Persistency RTAG 7
CMS Reconstructed Objects Rec. Event S-Track Reconstructo r S Track “aod” . . S Track Vincenzo Innocente, CERN/EP Reconstructed Objects produced by a given “algorithm” are managed by a Reconstructor. calibration dependent CPU intensive “esd” Track Sec. Info “rec” Track Constituent s Vector of RHits Persistency RTAG A Reconstructed Object (Track) is split into several independent persistent objects to allow their clustering according to their access patterns (physics analysis, reconstruction, detailed detector studies, etc. ). The top level object acts as a proxy. Intermediate reconstructed objects (RHits) are cached by value into the final objects. Possible to Recalibrate aod (and generate a new “version” without modify or copy the esd and rec) 8
Raw Event Index Raw. Event Raw. Data belonging to different “detectors” are clustered into different containers. The granularity will be adjusted to optimize I/O performances. Read. Out Raw. Dat a Vector of Digi . . . Raw. Dat a An index at Raw. Event level is used to avoid the access to all containers in search for a given Raw. Data. Vector of Digi Index implemented as an ordered vector of pairs Vincenzo Innocente, CERN/EP Raw. Data are identified by the corresponding Read. Out. Persistency RTAG A range index at Raw. Data level could be used for fast random access in complex detectors. 9
A Oid proposal We propose to use an object identifier composed of three fields: n Navigation-Scope (Sea? ? ) identifier n Always implicit: explicit use limited to cross reference among disjoint stores (for instance event toward calibration) n n n Concrete implementation of the Sea is a file catalog Data Product Id (dp-id) n n Unique and immutable identifier (in a given sea) of a data product To simplify lookup in case of scattering-gathering we suggest it includes a field identifying the logical-file (lf-id) n n Nothing prevents to use a context for a dataset or even an event In writing one can easily stream all logical-files into the same physical file For a given sea a physical file can map to multiple logical-files In small seas (lakes) (such as a local replica of selected events) even a m-to -n mapping could be affordable Object index n n Used to identify single objects in the data product If the Data product is WORM indexing will work whatever data structure is used below a Data Product Vincenzo Innocente, CERN/EP Persistency RTAG 10
Data Product id resolution A possible implementation Sea is responsible for mapping a lf-id to a given strategy to resolve a data-product-id The same dp-id can be resolved differently depending in which sea we are navigating n Physical resolution strategy n n Local mapping n n Lf-id identifies a file (not necessarily the original one) the rest of the dpid is used to look-up in a table contained in the file itself Global mapping n n Lf-id identifies a file, the rest of the dp-ida physical location (objyl ike) Lf-id identifies a table, the rest of the dp-id is used to look-up in the table for the physical location of the data-product …. Vincenzo Innocente, CERN/EP Persistency RTAG 11