INFNGRID Monitoring Group report Roberto Barbera INFN Catania
INFNGRID Monitoring Group report Roberto Barbera (INFN Catania) Paolo Lo Re (INFN Napoli) Giuseppe Sava (INFN Catania) Gennaro Tortone (INFN Napoli) Napoli – November 2002
Monitoring of grid elements Computing Element Worker Node Storage Element Resource Broker Replica Catalog Information Index […] LOW LEVEL measurements n CPU load n memory usage n disk usage (per partition) n network activity n number of processes n number of users (UI) n … SERVICE checks n gatekeeper n gsiftp n gris n gdmp n RB/LB n … (1/2) Replica Catalog “GRID” measurements n number of total CPUS n number of free CPUS n number of running jobs n number of waiting jobs n SE free disk space n …
Monitoring of grid elements n sources of information n n (2/2) LOW LEVEL measurements -> plugins/sensors installed on each machine SERVICE checks -> sensors installed on monitoring server GRID measurements -> sensors installed on monitoring server aggregate information n per VO per site …
The idea… The idea was/is to use Nagios: n to view a “snapshot” of the GRID/Testbed resources status, services availability and network measurements; n to receive notifications on host or service faults; n to view graphs of resource monitoring results or network measurements; Nagios was the “official choice” of INFN Testbed Technical Board for monitoring of INFN Testbed 1
Nagios (1/2) Nagios (www. nagios. org) is a (Open. Source) network monitoring tool developed by Ethan Galstad and designed to run under Linux. Some of its features include: n simple plugins design that allows users to easily develop their own service checks n monitoring of network services (FTP, HTTP, SSH, …) n monitoring of host resources (CPU load, disk usage, …) n ability to define network host (or device) “hierarchy” using “parent” host, allowing detection and distinction between host that are down and those that are unreachable n distributed monitoring: a “central Nagios server” obtains check results from one or more “Nagios distributed servers”
Nagios n n n (2/2) contact notifications when service or host problems occour (via email or user defined method) ability to define event handlers to be run during service or host events for “proactive” problem resolution logging mechanism and automatic log-file rotation optional plugins to send SNMP queries to host or network devices (router, switches, …); web interface for view current network status, notifications and problem history, logfile, …
Addons developed by INFNGRID monitoring group n graphs of resources monitoring results: we have developed a “wrapper” that parses the output of a plugin execution and insert monitoring values into a RRD (Round Robin Database www. rrdtool. org). An user, from Nagios web interface, can view daily, weekly, monthly or yearly graphs for a selected resource n “LDAP based” plugins: another thread of development activities is the implementation of plugins that “pull” information from a MDS server, instead than from resources
News n Nagios and a new web-interface was used during World. GRID demo in n n Super. Computing 2002 (Baltimore) IST 2002 (Copenhagen) World. GRID is an intercontinental Testbed between US and EU (between the i. VDGL-Trillium and Data. TAG(EDT)-EDG projects)
And now… … a short demo !
- Slides: 9