Monitoring a Control System Using Nagios Ralph Lange

  • Slides: 15
Download presentation
Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL R.

Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

What is the Situation? Machine Status vs. Controls Infrastructure Status • • Machine status:

What is the Situation? Machine Status vs. Controls Infrastructure Status • • Machine status: – usually handled in the Control Room by an operator – uses the Alarm Handler or other EPICS tools – based on Channel Access connections Control System infrastructure can be comparably complex, its status: – needs to be handled outside the Control Room – with tools that allow remote access – using different types of connections/checks: ping, snmp, http, Channel Access, disk usage, . . . BESSY was starting to have an increasing number of failures due to ageing hardware One summer day Mauro (preparing an EPICS training in hot Italian summer) was asking me if I knew Nagios. . . R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

What is Nagios? Nagios (“nah-ghee-ose”) • • • Open source monitoring framework – widely

What is Nagios? Nagios (“nah-ghee-ose”) • • • Open source monitoring framework – widely used & actively developed: www. nagios. org Host and service problems detection and recovery Provides wide set of basic plugins (checks) – easy to develop custom plugins Active vs. passive checks Centralized vs. distributed deployment – also allows redundant Nagios daemons • High configurability – service dependencies, fine-grained notification options • Web interface – status view, administration (e. g. analysis, downtime scheduling) R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

The Plugin (Check) Interface Plugins (Checks) • Checks are command line programs that follow

The Plugin (Check) Interface Plugins (Checks) • Checks are command line programs that follow a convention for arguments, stdout output, and return code: nagiosplugins. org – Output: one line of status info – Return code: OK / WARNING / CRITICAL / UNKNOWN • Can be written in any (i. e. your favourite) compiled or interpreted language • Are configured into Nagios for local or remote execution Passive Checks • An external application can write check results (following a certain format) into a file (or a pipe) • Nagios reads from this and accepts the results (if configured) R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Nagios + CA Plugin = NAL Nagios Channel Access Plugins • • • caget

Nagios + CA Plugin = NAL Nagios Channel Access Plugins • • • caget type plugin (active check) by Mauro Giacchini (LNL) • Can check the health of EPICS integrated VME crates, VME IOCs, soft IOCs, PLCs, CA gateways, CA archivers, . . . as well as OPI machine and server health, disk status, network device status, NTP, DNS, web services etc. • Allows NAL (Nagios Alarm Handler) to be the central monitoring system for all control system infrastructure, whereas the ALH in the control room provides similar functionality for the controlled facility camonitor type daemon (passive check) by Debby Quock (APS) Integrate data available through CA into the Nagios monitoring framework R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Current Configuration at BESSY Servers • • All machines: ping, disk usage, load, processes,

Current Configuration at BESSY Servers • • All machines: ping, disk usage, load, processes, users, SSH Some: DNS (foreign and internal addresses), NTP vx. Works IOCs • Ping, CPU load, memory usage, FD usage Services • • Wikis, web server, help pages, issue trackers (Trac/Redmine), elog Oracle servers: Ping, ODB Telnet, ODB TNS for important DBs => 296 checks on 111 hosts R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Tactical Overview R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Tactical Overview R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Detail R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Availability Report R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Availability Report R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Trends R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Screen Shots: Service Trends R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Firefox/Thunderbird Plugin • • Highly configurable, many filtering options New alarm starts blinking and

Firefox/Thunderbird Plugin • • Highly configurable, many filtering options New alarm starts blinking and may play sound Mouse-over opens a pop-up showing the current alarms Clicking an alarm opens the related Nagios page in a tab R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Experiences Nagios is a very stable and reliable framework, configuration is flexible, options and

Experiences Nagios is a very stable and reliable framework, configuration is flexible, options and plugins are many Off control room, web based, email notification approach fits our controls group better than ALH Manual configuration can be tedious, some parts could (should!) be generated from our RDB Found some network problems, one running system clock, two disks filling up, IOC load and memory saturation on a number of mv 162 s (which were replaced by mv 2100 s) R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Next Steps To be configured: • • Soft IOCs, CA Gateways, VME crates (Wiener),

Next Steps To be configured: • • Soft IOCs, CA Gateways, VME crates (Wiener), Embedded Controllers NFS shares usage, switches/routers, printers Checks to be written: Conserver (IOC console access) CA Archiver (through Archive. Manager web interface) CA access rights (based on cainfo) Collaborate: • • Integrate CA check plugin development Agree on a common place for our plugins (APS? Sourceforge? Nagios? ) R. Lange, M. Giacchini: Monitoring a Control System Using Nagios

Liv. EPICS Example Live Example: Mauro Giacchini's Liv. EPICS distribution includes Nagios 3. 0

Liv. EPICS Example Live Example: Mauro Giacchini's Liv. EPICS distribution includes Nagios 3. 0 (configured to look at the EPICS Base example app channels) Go check it out – now! R. Lange, M. Giacchini: Monitoring a Control System Using Nagios