Nagios in the Real World Dave Williams Technical

  • Slides: 54
Download presentation
Nagios in the Real World Dave Williams Technical Architect

Nagios in the Real World Dave Williams Technical Architect

Agenda 2 ©Bull, 2011 Presentation Title

Agenda 2 ©Bull, 2011 Presentation Title

Agenda - Introduction - General Background - System Monitoring Background - UK Customer Examples

Agenda - Introduction - General Background - System Monitoring Background - UK Customer Examples - Example Implementations of Nagios - Datacentre Monitoring with Nagios What is a Datacentre ? - Software & Hardware combinations - Vision - - Conclusions 3 ©Bull, 2011 Presentation Title

Background - UK based - Mainframe (IBM & Honeywell) Unix (HP-UX, AIX, Solaris) -

Background - UK based - Mainframe (IBM & Honeywell) Unix (HP-UX, AIX, Solaris) - Network (CASE, 3 COM, CISCO) - - Working for Bull French Computer Manufacturer - Mainframes, Unix, HPC, Security, Managed Services - 4 ©Bull, 2011 Presentation Title

Background - System Monitoring - Open. View Netview - Open Master - - Open

Background - System Monitoring - Open. View Netview - Open Master - - Open Source Monitoring Net. Saint on AIX - Nagios - 5 ©Bull, 2011 Presentation Title

Example Implementations 6 ©Bull, 2011 Presentation Title

Example Implementations 6 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service - Responsible for the prosecution of crime in Scotland

Crown Office Procurator Fiscal Service - Responsible for the prosecution of crime in Scotland - Investigation of suspicious deaths - Complaints against the Police - IT Locations in Glasgow & Edinburgh Windows at every Courts of Justice in Scotland - AIX / Oracle DB at Glasgow & Edinburgh - 7 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service - Already used Solarwinds for some network monitoring -

Crown Office Procurator Fiscal Service - Already used Solarwinds for some network monitoring - Strategy demanded AIX based monitoring & reporting In a competitive tender Nagios selected - Main success points were – simplicity, ease of customisation - Fitted within AIX based distance data replication already in use - 8 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service - 60+ Windows systems monitored for CPU, Disk -

Crown Office Procurator Fiscal Service - 60+ Windows systems monitored for CPU, Disk - 9 Space etc 2 AIX servers monitored for CPU, Disk Space etc Two Oracle Instances monitored for performance and DBspace usage All alerts shown on monitor screen and if necessary SMS Text alerts - Installed 2005, still working - Provides ‘backstop’ to Solarwinds for capacity monitoring on the WAN & LAN. ©Bull, 2011 Presentation Title

Rother District Council - “Working with the community to improve the overall well-being of

Rother District Council - “Working with the community to improve the overall well-being of the District “ Responsible for Waste Collection, Housing, Planning & Building Control - The District covers some 200 square miles and serves a population of around 90, 000 inhabitants. - 10 ©Bull, 2011 Presentation Title

Rother District Council - Monitoring 20+ Windows Servers for CPU, Disk - Utilsation etc

Rother District Council - Monitoring 20+ Windows Servers for CPU, Disk - Utilsation etc Monitoring numerous disparate Applications Reporting on Availability - Monitoring Printer status - Unexpected benefits - 11 ©Bull, 2011 Presentation Title

North Yorkshire County Council - Internet Access system for 30, 000 pupils - Monitoring

North Yorkshire County Council - Internet Access system for 30, 000 pupils - Monitoring e-mail, internet access, IDS, AV, Webservers Reporting on Availability - Monitoring Service Level Indicators - Mix of application providers (Scalix, Plesk) - Mix of appliance systems – Cisco, Panda, Radware, Net. Enforcer, My. Filter - 12 ©Bull, 2011 Presentation Title

North Yorkshire County Council - System Schematic 13 ©Bull, 2011 Presentation Title

North Yorkshire County Council - System Schematic 13 ©Bull, 2011 Presentation Title

North Yorkshire County Council - Uses NRPE to perform active checks on hosts -

North Yorkshire County Council - Uses NRPE to perform active checks on hosts - Multi O/S support Debian - Red. Hat - - Uses NSCA to accept check results from Windows - 14 Via Nagios. Event. Log ©Bull, 2011 Presentation Title

North Yorkshire County Council - E-mail - - AV systems Scalix running on Redhat

North Yorkshire County Council - E-mail - - AV systems Scalix running on Redhat Cluster. Checking all processes, cluster state etc. - PLESK Web server Checking availability of web sites via test installation - Monitoring disk utilsation and processor utilisation - 15 ©Bull, 2011 - Monitoring availability - Checking on AV database - Myfilter Monitoring email filters running - Checking that sufficient email filters are available - Presentation Title

North Yorkshire County Council - E-mail - Nagios server runs external email loopback test

North Yorkshire County Council - E-mail - Nagios server runs external email loopback test every 20 minutes to confirm external reachability. - PLESK Web server - 16 - Net. Backup Straightforward implementation of check_http ©Bull, 2011 Monitoring that backups have run - Checking that enough backup tapes are available - - Business Availability Define which services constitute a business line - 07: 00 check – tell support before the customers come on line - Presentation Title

NYCC - Nagiosgraph Uses process_performance _data - Example of Unix load average - 17

NYCC - Nagiosgraph Uses process_performance _data - Example of Unix load average - 17 ©Bull, 2011 Presentation Title

NYCC – Nagios Monitoring - Scalix Email System 18 ©Bull, 2011 Presentation Title

NYCC – Nagios Monitoring - Scalix Email System 18 ©Bull, 2011 Presentation Title

NYCC - Alerts sent via email to customers as well as support - Backup

NYCC - Alerts sent via email to customers as well as support - Backup notifications via SMS Text - Use Nagios Looking Glass for Customer View - nagiosgraph used to catch all service performance data Debian & Redhat perfomance metrics - Network throughput from LAN switches - LDAP response time - 19 ©Bull, 2011 Presentation Title

Datacentre Monitoring with Nagios 20 ©Bull, 2011 Presentation Title

Datacentre Monitoring with Nagios 20 ©Bull, 2011 Presentation Title

What is a Data. Centre ? - A data center (or datacentre) is a

What is a Data. Centre ? - A data center (or datacentre) is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls and security devices. (Wikipedia) 21 ©Bull, 2011 Presentation Title

How good is your Data. Centre ? - The TIA-942: Data Center Standards Overview

How good is your Data. Centre ? - The TIA-942: Data Center Standards Overview describes the requirements for the data centre infrastructure. The simplest is a Tier 1 data centre, which is basically a server room, following basic guidelines for the installation of computer systems. The most stringent level is a Tier 4 data centre, which is designed to host mission critical computer systems, with fully redundant subsystems and compartmentalized security zones controlled by biometric access controls methods. (Wikipedia) 22 ©Bull, 2011 Presentation Title

What is a Data. Centre ? - Tier 1 Requirements - - Tier 2

What is a Data. Centre ? - Tier 1 Requirements - - Tier 2 Requirements - - - 23 Fulfills all Tier 1 and Tier 2 requirements Multiple independent distribution paths serving the IT equipment All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture Concurrently maintainable site infrastructure guaranteeing 99. 982% availability Tier 4 Requirements - - Fulfills all Tier 1 requirements Redundant site infrastructure capacity components guaranteeing 99. 741% availability Tier 3 Requirements - - Single non-redundant distribution path serving the IT equipment Non-redundant capacity components Basic site infrastructure guaranteeing 99. 671% availability Fulfills all Tier 1, Tier 2 and Tier 3 requirements All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC) systems Fault-tolerant site infrastructure with electrical power storage and distribution facilities guaranteeing 99. 995% availability ©Uptime Institute ©Bull, 2011 Presentation Title

What is a Green Data. Centre ? - The most commonly used metric to

What is a Green Data. Centre ? - The most commonly used metric to determine the energy efficiency of a data centre is power usage effectiveness, or PUE. This simple ratio is the total power entering the data centre divided by the power used by the IT equipment. - - 24 PUE = Total facility Power / IT Equipment Power used by support equipment, often referred to as overhead load, mainly consists of cooling systems, power delivery, and other facility infrastructure like lighting. The average data centre in the US has a PUE of 2. 0, meaning that the facility uses one Watt of overhead power for every Watt delivered to IT equipment. State-of-the-art data centre energy efficiency is estimated to be roughly 1. 2. ©Bull, 2011 Presentation Title

Bull Datacentre BC 1 ? - New datacentre build on an already existing site

Bull Datacentre BC 1 ? - New datacentre build on an already existing site - Design criteria PUE 1. 6 - Easily expanded on demand - Tier 3 25 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - What do you get for £ 1. 2

Bull UK Datacentre BC 1 - What do you get for £ 1. 2 m ? 26 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - New Mains Incomer - 3 x Ambient CRAC

Bull UK Datacentre BC 1 - New Mains Incomer - 3 x Ambient CRAC Units Took feed from 11 Kv ring - Had to build own substation - - 1. 2 Mw Generator Required 8000 litre fuel tank - Switchgear to automatically start generator if mains incomer fails (10 -45 seconds) - 27 ©Bull, 2011 Cooling via external temperature differential - N+1 configuration - Hot Aisle Containment - - In-Line UPS only required to keep IT equipment running until generator fires up - Uses space in Cab rows, easily scalable according to load - Presentation Title

Bull UK Datacentre BC 1 - Monitoring - - Physical Environment - APC Netbotz

Bull UK Datacentre BC 1 - Monitoring - - Physical Environment - APC Netbotz Devices Translate inputs from sensors • Humidity, Temperature, Dew Point SEAL I/O Dry Contact Voltage indicators • For CRAC, FM 200, Generator, UPS • - Electrical Efficiency Power. Logic ION software reads from power meters - Power meter on every Distribution Board - Real-time calculation of PUE - 28 ©Bull, 2011 Every PDU strip (2 per Cab) monitored for power consumption & problems - A number of PDU strips also have remote control down to socket level - • - Power Distribution - Management Network LAN infrastructure required to support the Datacentre - Servers required to support the datacentre - External alert mechanisms - Presentation Title

Bull UK Datacentre BC 1 - What does Netbotz look like ? 29 ©Bull,

Bull UK Datacentre BC 1 - What does Netbotz look like ? 29 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - What does Sea. Level look like ? 30

Bull UK Datacentre BC 1 - What does Sea. Level look like ? 30 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - What does ION look like ? 31 ©Bull,

Bull UK Datacentre BC 1 - What does ION look like ? 31 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - What does a metered PDU look like ?

Bull UK Datacentre BC 1 - What does a metered PDU look like ? 32 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - What does a managed PDU look like ?

Bull UK Datacentre BC 1 - What does a managed PDU look like ? 33 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Nagios Map 34 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Nagios Map 34 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Nagios Host Groups 35 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Nagios Host Groups 35 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Do things go wrong - yes 36 ©Bull,

Bull UK Datacentre BC 1 - Do things go wrong - yes 36 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC 1 - Do things go wrong - yes & no

Bull UK Datacentre BC 1 - Do things go wrong - yes & no 37 ©Bull, 2011 Presentation Title

Datacentre Monitoring Schematic 38 ©Bull, 2011 Presentation Title

Datacentre Monitoring Schematic 38 ©Bull, 2011 Presentation Title

Nagios Products in use - Nagios Core - NRPE - NSCA - Nagios Looking

Nagios Products in use - Nagios Core - NRPE - NSCA - Nagios Looking Glass - Nagvis - Event. DB - SNMPTT - Nagmap - NDO 39 ©Bull, 2011 Presentation Title

Other Open Source Products in use 40 Nedi Arpwatch PSAD SMS-Client Bacula Confluence (Wiki)

Other Open Source Products in use 40 Nedi Arpwatch PSAD SMS-Client Bacula Confluence (Wiki) i-doit (ITIL CMDB) MRTG Routers 2 cgi ©Bull, 2011 Presentation Title

BC 1 Datacentre Monitoring Elements - Nagios Core - Nagios Customer System Normal install

BC 1 Datacentre Monitoring Elements - Nagios Core - Nagios Customer System Normal install with direct polling of devices - Only looking at Datacentre - - Nagios Display System Central reporting Nagios - Absorbs updates from other Nagios instances - - Information Display - 41 Running on an appliance connected to Customer network - Sends data via encrypted secured link to Display System - - Backup System Use tape library - Hosts CMDB & Wi. Ki - Normal system with 5 heads ©Bull, 2011 Presentation Title

BC 1 Datacentre Nagios Core - Hardware Platform - Intel - O/S Centos 5

BC 1 Datacentre Nagios Core - Hardware Platform - Intel - O/S Centos 5 - Xeon 2. 8 Ghz , 8 Gb memory, 72 GB RAID-1 disk - Built from source tarball - Installed from RPM - Nagios 3. 2. 0 - Nagios Plugins 1. 4. 15 -2 42 ©Bull, 2011 Presentation Title

BC 1 Datacentre Nagios Display System - Hardware Platform - Intel - O/S Fedora

BC 1 Datacentre Nagios Display System - Hardware Platform - Intel - O/S Fedora Core 9 P 4 2. 8 Ghz , 2. 5 Gb memory, 76 GB RAID-1 disk - Nvidia dual monitor display Card – DVI interfaces - - Nagios 3. 0. 6 - Built from source tarball - Installed from RPM - Nagios Plugins 1. 4. 13 -9 43 ©Bull, 2011 Presentation Title

BC 1 Datacentre Normal Display System - Hardware Platform - AMD - O/S Centos

BC 1 Datacentre Normal Display System - Hardware Platform - AMD - O/S Centos 5 Athlon 1. 2 Ghz , 1. 0 Gb memory, 3 GB disk - Matrox G 200 Quad Head - - Runs console displays – http/RDP/ssh 44 ©Bull, 2011 Presentation Title

BC 1 Datacentre Customer System - Hardware Platform – Motion Tablet - O/S Ubuntu

BC 1 Datacentre Customer System - Hardware Platform – Motion Tablet - O/S Ubuntu 10. 04 LTS Pentium M 1. 5 Ghz , 0. 5 Gb memory, 30 GB disk - Touch Screen tablet system - - Nagios 3. 2. 3 - Built from tarball - Sends status (encrypted) to central reporting system - Nagios Plugins 1. 4. 15 - Nagios NSCA 45 ©Bull, 2011 Presentation Title

BC 1 Datacentre Backup System - Hardware Platform – Intel - O/S Centos 5

BC 1 Datacentre Backup System - Hardware Platform – Intel - O/S Centos 5 - Xeon 3. 06 Ghz , 2. 0 Gb memory, 108 GB disk - Uses Bacula 5. 0. 3 Controls SDLT 20 slot tape library - Backs up all Datacentre Infrastructure - Windows • Centos • Ubuntu • 46 ©Bull, 2011 Presentation Title

Conclusions 47 ©Bull, 2011 Presentation Title

Conclusions 47 ©Bull, 2011 Presentation Title

Conclusions - Strategic Overall Design - You will have to make it Know what

Conclusions - Strategic Overall Design - You will have to make it Know what you need to monitor - Know who needs to be told - - Expect to throw the first version away Only when you have fully engineered the solution will you understand all of the issues - Keep a record of design decisions - 48 ©Bull, 2011 pretty for management Accept that an attractive display will be required - Reporting will become key - - It must be reliable Make backups - Consider clustering & recovery options - Presentation Title

& Hints 49 ©Bull, 2011 Presentation Title

& Hints 49 ©Bull, 2011 Presentation Title

Hints & Experience - - Separate Display systems from Monitoring systems If you are

Hints & Experience - - Separate Display systems from Monitoring systems If you are tracking 10, 000’s of services you don’t want processor heavy graphics as well - Escalation & Alerting take time 50 ©Bull, 2011 - Don’t give in – there is always a way to get Nagios involved - Screen scrape, email, telnet, RS 232 are all possible SNMP is your friend When in doubt use SNMP to help you out - SNMP V 3 with AES cypher is suitably secure for most implementations - Firstly to get right with your organisation - Secondly to actually physically do ! - Suppliers go out of their way to make it difficult Presentation Title

52 ©Bull, 2011 Presentation Title

52 ©Bull, 2011 Presentation Title

53 ©Bull, 2011 Presentation Title

53 ©Bull, 2011 Presentation Title