Performance Monitoring Claudio Grandi INFN Bologna Claudio Grandi

Performance Monitoring Claudio Grandi (INFN Bologna) Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 1

Note All the following applies to jobs handled by WMAgent Features appear in the Tier-0 as it transitions to WMCore/WMAgent code base Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 2

Real Time monitoring HLT-like real time monitoring not achievable on the distributed infrastructure. Possible at the Tier-0 The method used by Emilio on the HLT farm is relatively simple for what concerns the production of the information but the collection is specific to the HLT/DAQ system. Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 3

Detailed monitoring (e. g. per-event metrics) should be managed in the event structure A possibility is to use the DQM data structures All possible reductions/aggregations to monitoring data must be done on the WN before storing them in the event Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 4

Per-job metrics appear in the FJR and are then stored in the dashboard Correlation with other job properties is done through the dashboard interface Current metrics in WMAgent A few could have been removed. . . If more metrics are needed we need to identify who writes them in the FJR Implemented as a CMSSW Service or added to the current Timing & Memory checker services Vincenzo proposed: VSS, RSS, PSS, CPU, mem/disk/network rates Claudio Grandi INFN Bologna CMS Monitoring Meeting job. Id retry. Count task. Id step. Name Peak. Value. Rss Peak. Value. Vsize write. Total. MB read. Percentage. Ops read. Averagek. B read. Total. MB read. Num. Ops read. Cache. Percentage. Ops read. MBSec read. Max. MSec read. Total. Secs write. Total. Secs Total. Job. CPU Total. Event. CPU Avg. Event. Time Min. Event. CPU Max. Event. Time Total. Job. Time Min. Event. Time Max. Event. CPU 7 July 2011 5

Time profiles for some of the metrics could be useful Should be produced on demand Should be returned together with the job output (FJR? ) Storing the time profiles in the dashboard is probably not doable. . . Is it? Do we need to develop visualization tools? Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 6

Modules occupancy Time spent in different modules could be handled in a way similar to the time profiles Something exists but need to reduce verbosity To be checked Store in the FJR? Handled by the dashboard? For what concerns the Tier-0 see what said for the real time monitoring Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 7

Alarms The same machinery used to produce the time profiles may be used to trigger alarms Start threads that monitor the time evolution of critical quantities Coupling them with a messaging system could allow to communicate with the WMAgent and somehow raise alarms to the users or the operators There is an open ticket in DMWM for that: https: //svnweb. cern. ch/trac/CMSDMWM/ticket/1416 Will be implemented in the WMAgent port of the Tier-0 Claudio Grandi INFN Bologna CMS Monitoring Meeting 7 July 2011 8