Use of Nagios in Central European ROC Emir

  • Slides: 38
Download presentation
Use of Nagios in Central European ROC Emir Imamagic University Computing Centre (SRCE) Croatia

Use of Nagios in Central European ROC Emir Imamagic University Computing Centre (SRCE) Croatia Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Overview v v v Motivation Nagios Grid monitoring with Nagios Sensors w Configuration management

Overview v v v Motivation Nagios Grid monitoring with Nagios Sensors w Configuration management w GOCDB integration w v v Demo slides Future work Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Motivation v Achieve better availability w v v getting notifications as soon as problem

Motivation v Achieve better availability w v v getting notifications as soon as problem appears Simplify maintenance of grid resources Complex sensor’s dependencies enables isolating the problem w only relevant notifications are issued w v Report generation w v availability, problem history Visualization & management interface Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Nagios Grid Monitoring WG core group meeting / Use of Nagios in Central European

Nagios Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Nagios v v Open source monitoring system Widely used & actively developed Host and

Nagios v v Open source monitoring system Widely used & actively developed Host and service problems detection and recovery Provides set of basic plugins (sensors) w v easy to develop custom sensors No components required on monitored entities Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Objects v Host physical server, workstation w network device (e. g. switch, router) w

Objects v Host physical server, workstation w network device (e. g. switch, router) w other devices connected to network w v Service service running on host w metric associated with the host w v v Service must be associated with host Objects can be aggregated in groups Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Sensor Execution v Per object sensor arguments adaptive monitoring w e. g. timeout w

Sensor Execution v Per object sensor arguments adaptive monitoring w e. g. timeout w v Per object checking interval each sensor has individual check interval w normal vs. problem check interval w v Per object number of recheck w v determines state type Advanced check scheduling w avoiding server overload Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Notifications v Per object configuration list of contacts w notification period, states & repeat

Notifications v Per object configuration list of contacts w notification period, states & repeat interval w used for authorization w v Contact configuration name and alias w host and service notification period, states & mechanism w email address w pager number w v Notification escalations w if the problem doesn’t get solved notifications escalates to next contact levels Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

States v Host states w v Service states w v Up, Down, Unreachable Ok,

States v Host states w v Service states w v Up, Down, Unreachable Ok, Warning, Unknown, Critical State types w soft • object has not been rechecked specified number of times w hard • object has been rechecked specified number of times • object recovers from problem state • causes notifications & event handlers Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Object Hierarchy v Implicit dependency w v service depends on associated host Host hierarchy

Object Hierarchy v Implicit dependency w v service depends on associated host Host hierarchy if parent is not OK, don’t send notifications for children (hosts and services) w Unreachable state w e. g. router is parent for all hosts on specific site w v Service dependency in which cases are check & notifications performed w one host/services can be dependent on multiple hosts/services w Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Dynamic Operations v Modifying monitoring & notification behavior acknowledging problems w enabling/disabling notifications w

Dynamic Operations v Modifying monitoring & notification behavior acknowledging problems w enabling/disabling notifications w enabling/disabling active checks w v Executing sensors individual service w all services on single host w v v Scheduling downtimes Achieved via web interface or pipeline Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Web Interface v v v Viewing current information, history and reports Performing dynamic operations

Web Interface v v v Viewing current information, history and reports Performing dynamic operations Generating reports w v availability, problem trends & history Supports authentication & authorization (AA) w per host/service authorization Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Other Features v Event handling w v enables automatic failure recovery Active vs. passive

Other Features v Event handling w v enables automatic failure recovery Active vs. passive checks active – controlled by Nagios w passive – submitted by other systems or another Nagios instance w v Distributed deployment multiple Nagios servers w individual instance submits results as passive checks to central w Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid monitoring with Nagios Grid Monitoring WG core group meeting / Use of Nagios

Grid monitoring with Nagios Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

History v CRO-GRID Infrastructure since mid 2005 w covers several grid middleware (Globus Toolkit

History v CRO-GRID Infrastructure since mid 2005 w covers several grid middleware (Globus Toolkit Pre-WS & WS, UNICORE, NWS, etc) w event handlers for automatic recovery w v Monitoring Central European (CE) core services w v Monitoring all CE sites for 1 st line support w v since mid 2006 since September, 2006 Also used for certification w with forced check Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Deployment v v Centralized deployment Single Nagios server deployed @ SRCE URL: http: //cs-egee.

Deployment v v Centralized deployment Single Nagios server deployed @ SRCE URL: http: //cs-egee. srce. hr/nagios Monitoring statistics 65 hosts w 480 services w v Nagios server statistics (last month) Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Supported Node Types Node type Number of services BDII 1 CE 8 LFC 2

Supported Node Types Node type Number of services BDII 1 CE 8 LFC 2 MON 3 PROX 2 RB 7 SE 9 VOMS 4 WMS 7 Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Nagios Basic Sensors Sensor Description Used interval check_ftp checks FTP server used for Grid.

Nagios Basic Sensors Sensor Description Used interval check_ftp checks FTP server used for Grid. FTP ping 15 min check_http checks HTTP server used for checking Tomcat on MON and VOMS 15 min check_ldap checks LDAP server for defined base dn used for checking BDII, Globus MDS and Grid. ICE 15 min check_tcp checks defined TCP port used for DPNS ping 15 min Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Developed Sensors Sensor Description Used interval CA distribution checks CA distribution version 1 day

Developed Sensors Sensor Description Used interval CA distribution checks CA distribution version 1 day Certificate lifetime uses Grid. FTP or HTTPS to fetch server certificate & verifies lifetime 1 day DPNS lists /dpm directory and looks for the remote server's domain 1 hour EDG Broker submits a test job, waits for the job to finish, fetches and verifies the output 1 hour Gatekeeper ping performs authorization only 15 min Gatekeeper hostname executes hostname and verifies the output 1 hour Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Developed Sensors Sensor Description Used interval Gatekeeper LRMS executes command through LRMS 2 hours

Developed Sensors Sensor Description Used interval Gatekeeper LRMS executes command through LRMS 2 hours Grid. FTP transfers file to remote computer and back and compares copies 1 hour LFC lists /grid directory 15 min Match list CE – matches CE against multiple RBs RB – compares number of matches with data from BDII 1 hour My. Proxy creates proxy certificate, gets the proxy info and destroys it 15 min Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Developed Sensors Sensor Description Used interval SRM ping perform SRM ping with glite-srm-ping 15

Developed Sensors Sensor Description Used interval SRM ping perform SRM ping with glite-srm-ping 15 min SRM transfers file to remote computer and back and compares copies 1 hour VOMS Proxy creates voms proxy for given VO 15 min VOMS Gridmap creates gridmap file for given VO and reports number of users 1 hour WMS same as EDG Broker, uses glite-job-* 1 hour WMProxy delegation delegates proxy to WMProxy 15 min WMProxy same as EDG Broker, uses glite-job-wms-* 1 hour Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Sensor Hierarchy v Host hierarchy w v router @ SRCE is parent to all

Sensor Hierarchy v Host hierarchy w v router @ SRCE is parent to all hosts Parent services lightweight w more frequent (15 min) w v Child services heavyweight & complex w less frequent (1 hour) w v Less overhead on monitored objects! Parent Service Child Service DPNS ping DPNS list Gatekeeper ping Gatekeeper hostname Gatekeeper LRMS Grid. FTP ping Grid. FTP transfer CA Distribution Grid. FTP transfer Tomcat Certificate lifetime SRM ping SRM transfer VOMS Tomcat VOMS Gridmap WMProxy delegation WMProxy Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Complex Sensors v Case when one service (target) depends on another service (mediator) w

Complex Sensors v Case when one service (target) depends on another service (mediator) w v v e. g. submitting job through grid scheduler to a specific CE, storing file through LFC to SE Sensor can use any available mediator service We developed Nagios interface for retrieving list of available mediators Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Configuration Management v v GOCDB Static configuration w v BDII w v e. g.

Configuration Management v v GOCDB Static configuration w v BDII w v e. g. nodes which are not in GOCDB, special contacts retrieving site-specific data (e. g. queue names, ports) Commands more site-specific data w e. g. check_ping, check_ldap w Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

GOCDB Integration v Site information w w w v nodes node types site BDII

GOCDB Integration v Site information w w w v nodes node types site BDII site contact Site Admins (for web interface authorization) Scheduled downtimes w data pulled 3 times a day Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Web Interface v Authentication w v we added certificate-based authentication Authorization admins can perform

Web Interface v Authentication w v we added certificate-based authentication Authorization admins can perform operations on their own sites only w region admin can perform operation on all sites w super admin can perform global Nagios operations w Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Demo slides Grid Monitoring WG core group meeting / Use of Nagios in Central

Demo slides Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC

Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Future Work v v Further sensor development Passive checks w v using other monitoring

Future Work v v Further sensor development Passive checks w v using other monitoring systems (e. g. Ganglia, Gstat) Distributed deployment Nagios per region/country w redundant servers w cluster for sensor execution w Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Thank You Questions? Grid Monitoring WG core group meeting / Use of Nagios in

Thank You Questions? Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region

Links v v v CE Nagios monitoring site http: //cs-egee. srce. hr/nagios CE Nagios

Links v v v CE Nagios monitoring site http: //cs-egee. srce. hr/nagios CE Nagios documentation http: //egee. grid. cyfronet. pl/core-services/nagios Nagios official web page http: //www. nagios. org Grid Monitoring WG core group meeting / Use of Nagios in Central European ROC region