Automated Grid Monitoring for LHCb Experiment through Hammer
Automated Grid Monitoring for LHCb Experiment through Hammer. Cloud Bradley Dice Valentina Mancinelli
Project Overview §Use Hammer. Cloud to… § Test LHCb data storage access § Ensure that new releases of user analysis programs function successfully §Why? § Temporarily disable sites with unreliable storage § Prioritize bug-fixing by most common problems § Keep the science moving!
Work falls into three categories: Front End Back End Grid Tests
Front End (User Interface) §Shows list of current and past tests and offers management tools §Progress: § Added data visualizations to categorize errors and the sites they affect (right) § Cleaned menu structures § Made job colors more easily understandable
Back End (Test Manager) §Interfaces between Ganga (to submit grid jobs) and Django (to display data) §Progress: § Hammer. Cloud sites automatically update to match the WLCG topology § Ganga jobs report back detailed information for analysis § The backend produces plots showing jobs by status: complete, running, schedule, or failed (right)
Grid Tests (Getting Results) §Detecting and classifying data access failure is the key purpose of Hammer. Cloud §Progress: § A postprocessor has to detect whether files were accessed locally or pulled from another site (failover) § Failover detection is presently difficult. Current collaboration with the developers of Ganga will help resolve this challenge.
Future Steps §Retrieve more job information (metrics on CPU time, etc. ) §Provide grid site status information to RSS (Resource Status System) §Create data visualizations requested by LHCb §Document code in Twiki for future developers
- Slides: 7