Lemon Monitoring Miroslav Siket German Cancio David Front

  • Slides: 15
Download presentation
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24 -26 May 2005 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna

Outline • • • Lemon Structure and design How it works, deployment Use cases,

Outline • • • Lemon Structure and design How it works, deployment Use cases, web interface Installation and setup Summary 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 2

Lemon – LHC Era Monitoring • Lemon is a system containing tools for monitoring

Lemon – LHC Era Monitoring • Lemon is a system containing tools for monitoring status and performance of computers: – Distributed monitoring system scalable to ~10 k nodes – Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters – Facilitates early error detection and problem prevention – Executes corrective actions and sends notifications – Provides persistent storage of the monitoring data – Offers a framework for further creation of sensors for monitoring – Site independent functionality • Link: http: //cern. ch/lemon • Part of the ELFms toolsuite: http: //cern. ch/elfms 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 3

Lemon Use • It is used in-and-outside CERN by: – System administrators, service managers,

Lemon Use • It is used in-and-outside CERN by: – System administrators, service managers, cluster responsibles – Developers and service/data challenges – Managers and general users • Deployments outside CERN: – EDG testbeds – Accelerator (AB) department at CERN – CMS online – Grid. ICE – BARC India (development partner) 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 4

Lemon architecture Prot Monitorin g Repositor TCP/UDP y SOAP Correlati on Engines Repository backend

Lemon architecture Prot Monitorin g Repositor TCP/UDP y SOAP Correlati on Engines Repository backend RRDTo ol / PHP apache HTTP Nodes Lemon CLI Monitoring Agent Sensor 25/05/2005 Sensor LCG Operations Workshop 2426/05/2005 Bologna Web browser User 5

Components • Lemon is a typical server/client application with following components: – MSA –

Components • Lemon is a typical server/client application with following components: – MSA – Monitoring Sensor Agent (Lemon Agent) • Daemon on a client machine that spawns multiple Monitoring Sensors to measure data in defined intervals and sends data to Monitoring Repository – MS - Monitoring Sensor • Uses standard C++, perl API – it is easy to write your own sensor • Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) – MR – Monitoring Repository • Server application that receives samples and processes/validates them • Stores the full monitoring history data • Two implementations - flat files or Oracle DB based – LRF - Lemon RRD Framework • Pre-processes data into rrd files and creates cluster summaries • These are used for web graphics • Provides service and cluster overview in its web displays – LAG – Lemon Alarm Gateway • Generic gateway for alarms (in development) • Gateways to Mon. ALISA and Grid. ICE exist 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 6

Lemon at CERN • • • Lemon monitors about 2200 computers in ~100 clusters

Lemon at CERN • • • Lemon monitors about 2200 computers in ~100 clusters On average it collects about 70 metrics from each host Integrated with Sure alarm system Collecting about 1. 5 GB/day LEAF (LHC-Era Automated Fabric) for high-level intervention scheduling Node Configuration Management Node Management Configuration • Derived from the Quattor Configuration Database (CDB) • individual configuration per cluster/host • hierarchical structure Alarm system • Sure – legacy system receiving alarms from Lemon • Integration with new LASER system (LHC alarm system) via LAG is ongoing 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 7

Web interface • Cluster view displays accumulated statistics and status for all machines in

Web interface • Cluster view displays accumulated statistics and status for all machines in the cluster • Host view gives overview of the host status with basic metrics • Other views available: – Rack view – Hardware type view – Other views can be added, working on user defined views • With the newest version (to be released soon): – Generic entry page displaying status overview of the key services – Configurable views In development: database services monitoring with database specific view • 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 8

Reboot occurrence history graph Use(ful) case • Kernel upgrade – Kernel version is “measured”

Reboot occurrence history graph Use(ful) case • Kernel upgrade – Kernel version is “measured” on the boot of the machine – Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info – Web interface allows monitoring of the progress 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 9

Computer Center display • • Lemon Web Interface can be interfaced with a Computer

Computer Center display • • Lemon Web Interface can be interfaced with a Computer Center database of objects (racks, silos, …) Provides search of objects as well as listing Interfaced through a XML defined geometry of the computer center Generic design that can be used anywhere: <? xml version="1. 0" ? > <CC> <ROOM ID=“ 0513 -S-0034" DESCRIPTION=“Tape Vault" R="0" G="0" B="0"> <DOORS R="0" G="255" B="0"> <DOOR X="63" Y="39" LX="64" LY="39" /> <DOOR X="34" Y="0" LX="36" LY="0" /> </DOORS> <RACKS R="0" G="0" B="203"> <RACK ID="EA 01" X="73" Y="9" LX="75" LY="10" PLANNED="0"/> <RACK ID="EA 03" X="73" Y="8" LX="75" LY="9" PLANNED="0"/> </RACKS> <WALLS R="0" G="0" B="0"> <WALL X="0" Y="0" LX="0" LY="60" /> <WALL X="0" Y="0" LX="76" LY="0" /> </WALLS> <STEPS R="255" G="163" B="0"> <STEP X="47" Y="36" LX="52" LY="37" /> <STEP X="47" Y="37" LX="52" LY="38" /> </STEPS> 25/05/2005 </ROOM> LCG Operations Workshop 2410 </CC> 26/05/2005 Bologna

Service challenges, GRID VOs • Lemon allows for – Virtual clusters • clusters defined

Service challenges, GRID VOs • Lemon allows for – Virtual clusters • clusters defined on request by service managers • or defined by scripts – updated dynamically on demand • or defined for specific purpose • Examples: Alice MDC, network challenges, … – Clusters defined dynamically • example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization • hooks in Lemon for defining any dynamic grouping of hosts 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 11

Automatic recovery actions and Alarms • Alarm Sensor – For defined values of measured

Automatic recovery actions and Alarms • Alarm Sensor – For defined values of measured metrics an actuator is called with predefined action – An example: ssh daemon dead – action /sbin/service sshd start – Definition: metric X, field Y <op> reference value Z => call actuator • <op> can be ==, <, >, regexp, range, etc. . • If success log only, else call action up to max times – Each occurrence is logged in the Monitoring Repository – Already about 70 predefined alarms with automatic recovery actions – After first month of deployment it reduced number of problem tickets by half • Correlation engine (CMDaemon) – Allows ‘global’ correlations, and in the future client/server alarms and recovery actions • Lemon Alarm gateway (LAG) – Lemon’s LAG can be used to feed alarms into arbitrary alarm systems (under development) 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 12

Installation and setup (I) Lemon installation consists of three steps: 1. Server installation 2.

Installation and setup (I) Lemon installation consists of three steps: 1. Server installation 2. Client installation 3. Web interface installation 1. Server installation: – install edg-fabric. Monitoring-server rpm (“flat file” server) – Configure receiving port in /etc/edg-fmon-server. conf – Start the server daemon 2. Client installation: – Install edg-fabric. Monitoring-agent rpm (comes with default metric configuration) – Configure server and its port in /etc/edg-fmon-agent. conf – Start the client daemon on all monitored hosts 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 13

Installation and setup (II) 3. Web interface installation – Install and start apache server

Installation and setup (II) 3. Web interface installation – Install and start apache server (with php) on your server – Install rrdtool and lrf (lemon rrd framework) rpms – Configure your clusters in clusters. conf file and start lemonmrd daemon • Drink Champagne… you have Lemon up and running! ; -) – You can do all this on your laptop! • Possible additional components: – – Computer center synoptic view through xml file Problem tracking system integration (through php plug-in to your DB/application) Quattor CDB configuration view – through CDB xml profiles Oracle based Repository (for very large installations with high scalability and increased functionality) – Other, new components are easy to add • View detailed instructions at: http: //cern. ch/lemon/doc/installation. html 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 14

Summary • • • Lemon serves to provide monitoring information about the farms in

Summary • • • Lemon serves to provide monitoring information about the farms in Computer Centers (or your laptop). Lemon provides framework for recovery actions and alarms. Lemon is easy to install (…and it is easy to add your own metrics and visualize them). It is flexible with respect to your needs – you can add clusters, views, specify your definition of virtual and dynamic clusters. It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems. For more information check http: //cern. ch/lemon 25/05/2005 LCG Operations Workshop 2426/05/2005 Bologna 15