Instrumentation of the SAMGrid Gabriele Garzoglio CSC 426
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal
Overview Ø Characteristics of the High Energy Physics Community • The SAM-Grid: enabling fully distributed analysis job processing • The Proposed Instrumentation
Characteristics of the work in High Energy Physics • High Energy Physics studies the fundamental interaction of Nature. • Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed. • Experiments become every decade more challenging/expensive: the collaborations are large groups of people. • The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed
The Fermi National Accelerator Laboratory
The Nature of the Data
An example: the D 0 Experiment • Detector Data – – – 1, 000 Channels Event size 250 KB Event rate ~50 Hz On-line Data Rate 12 MBps Est. 2 year totals (incl Processing and analysis): • 1 x 109 events • ~0. 5 PB • Monte Carlo Data (simulations) – 5 remote processing centers – Estimate ~300 TB in 2 years.
The D 0 Collaboration • ~500 Physicists • 72 institutions • 18 Countries
How can all of them work together ? Using Large Distributed System Middleware: the Grid
Overview ü Characteristics of the High Energy Physics Community Ø The SAM-Grid: enabling fully distributed analysis job processing • The Proposed Instrumentation
The SAM-Grid Project • Mission: enable fully distributed computing for DZero and CDF • Strategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM) • Funds: the Particle Physics Data Grid (US) and Grid. PP (UK) • People: Computer scientists and Physicists from Fermilab and the collaborating Universities • History: SAM from 1997, JIM from end of 2001 • Schedule: CDF and DZero are running now! A prototype is running, scheduled for production in Spring 03; long-term deliverables in 2 yrs.
The Logistics
User Interface Submission Client Job Management Match Making Service Queuing System Information Collector JOB Execution Site #1 Computing Element Grid Sensors Execution Site #n Data Handling System Storage Element Resource Selector Computing Element Storage Element Data Handling System Storage Element Grid Sensors Computing Element
Overview ü Characteristics of the High Energy Physics Community ü The SAM-Grid: enabling fully distributed analysis job processing Ø The Proposed Instrumentation
Why is this useful ? The SAM-Grid is a complex system: the instrumentation is of critical importance to • Troubleshoot the system – Production systems are maintained 24 x 7 – Ease user support – Find anomalies/bugs • Gather statistics – User data access patterns – Resource utilization – Global parameter optimization
Why is this challenging ? • The SAM-Grid is composed of hundreds of servers, widely geographically distributed: what is a suitable architecture ? • Servers have very diverse functionalities: is it possible to enable some form of uniform data access ?
Current instrumentation…. • The SAM System uses a global log service: every SAM Server records free-format events/messages • JIM V 1 is under intense development: the current instrumentation is insufficient
…and its limitations • The current log server is centralized: for the SAM system only it records 1 GB every few days. This does not scale. • Message transport is UDP-based: this scales in the number of reporting servers, but data integrity is not guaranteed. • The messages are not structured: data mining / presentation is non-trivial.
The direction 1 • The CODA distributed File System is a good example of successful distributed architecture for instrumentation. Client Server Data Collector Database Data Log Reaper Off-Line Analyses
The direction 2 • The structure of the message should include: • the name of the client/server • the types of the client/server: various groupings may be meaningful i. e. logistical, functional, logical, etc. • the location of the client/server • a global time stamp • an id code, related to the severity of the message
Rough time estimate • 1 FTE month to design the architecture + the message structure • 1 FTE month to implement basic messaging • 1 FTE month to study initial results • 1 FTE month to feedback changes to the message structure and implementation
- Slides: 22