MONITORING WITH MONALISA Costin Grigoras costin grigorascern ch

  • Slides: 16
Download presentation
MONITORING WITH MONALISA Costin Grigoras <costin. grigoras@cern. ch>

MONITORING WITH MONALISA Costin Grigoras <costin. grigoras@cern. ch>

3 Mon. ALISA communication architecture 4 Monitoring modules 6 Ap. Mon 7 Data storage

3 Mon. ALISA communication architecture 4 Monitoring modules 6 Ap. Mon 7 Data storage model 8 Mon. ALISA clients 9 Ali. En monitoring architecture 10 The Repository 11 Automatic actions 13 Putting it all together 14 Useful links 15 MONITORING WITH MONALISA What is Mon. ALISA ?

Caltech project started in 2002 http: //monalisa. caltech. edu/ Java-based set of distributed, self-describing

Caltech project started in 2002 http: //monalisa. caltech. edu/ Java-based set of distributed, self-describing services Offers the infrastructure to collect any type of information Can process it in near real time The services can cooperate in performing the monitoring tasks Can act as a platform for running distributed user agents MONALISA What is Mon. ALISA ?

Clients HL services Data consumers Multiplexing layer Helps firewalled endpoints connect Proxies Agents Mon.

Clients HL services Data consumers Multiplexing layer Helps firewalled endpoints connect Proxies Agents Mon. ALISA services Data gathering services Network of JINI-Lookup Services Secure & Public Registration and discovery Fully Distributed System with no Single Point of Failure MONALISA COMMUNICATION ARCHITECTURE Mon. ALISA software components and the connections between them

ML Service AGENTS n R tio a r t s egi Lookup Service Dis

ML Service AGENTS n R tio a r t s egi Lookup Service Dis cov ery Predicates & Agents Data Store Data (via ML Proxy) ML Client Applications Configuration Control (SSL) FILTERS / TRIGGERS Monitoring Modules Push or Pull, depending on device Dynamic Loading MONALISA COMMUNICATION ARCHITECTURE Subscriber/notification paradigm

The service package includes: -Local host monitoring (CPU, memory, network traffic , processes and

The service package includes: -Local host monitoring (CPU, memory, network traffic , processes and sockets in each state, LM sensors, APC UPSs), log files tailing -SNMP generic & specific modules -Condor, PBS, LSF and SGE (accounting & host monitoring), Ganglia -Ping, tracepath, traceroute, pathload and other network-related measurements -Ciena, Optical switches -Calling external applications/scripts that return as output the values -XDR-formatted UDP messages (such as Ap. Mon). New modules can be added by implementing a simple Java interface. Filters can also be defined to aggregate data in new ways. The Service can also react to the monitoring data it receives, more about the actions it can take later. Mon. ALISA can run code as distributed agents -Used by VRVS/Evo to maintain the tree of connections between reflectors -Eof optical paths between two network endpoints for particular data transfers MONITORING MODULES Mon. ALISA service includes many modules; easily extendable

Ap. Mon is a collection of libraries in various languages (C, C++, Java, Perl,

Ap. Mon is a collection of libraries in various languages (C, C++, Java, Perl, Python), all offering a simple API to sending monitoring information Based on the XDR open format of data packing (efficient packing of the values) Implemented over UDP to minimize the impact on the monitored application Allows applications to send particular values and also provides local host monitoring The Perl version is used by Ali. En to send monitoring information from each service and Job. Agent to the site local Mon. ALISA instance The C/C++ implementations are used in ROOT (TMonalisa. Writer), Proof and Xrootd. CMS also uses Ap. Mon to send monitoring information from all jobs to a single point Can also be used stand-alone, in a wrapper application that loops forever, sending the default host monitoring parameters + any other interesting values (like some services’ status) It can be configured by API, local configuration file or URL of one APMON Embeddable APlication MONitoring library

Mon. ALISA keeps a memory buffer for a minimal monitoring history In addition, data

Mon. ALISA keeps a memory buffer for a minimal monitoring history In addition, data can be kept in configurable database structures -The service keeps one week of raw data and one month of averaged values -The client creates three averaged structures Parallel database backends can be used to increase performance and reliability Memory buffer Volatile storage Short term, high resolution Medium term, lower resolution Long term, low resolution time Request at highest resolution Response Persistent storage (DB) DATA STORAGE MODEL Shared codebase between components

GUI client -Interactive exploring of all the parameters -Can plot history or real-time values

GUI client -Interactive exploring of all the parameters -Can plot history or real-time values -Customizable history query interval -Subscribes to those particular series and updates the plots in real time Storage client (aka Repository) -Subscribes to a set of parameters and stores them in database structures suitable for long-term archival -Is usually complemented by a web interface presenting these values -Can also be embedded in another controlling application Web. Services clients -Limited functionality: they lack the subscription mechanism MONALISA CLIENTS Two important clients

Ap. Mon run tim e Mon. ALISA Ali. En Site rss Ali. En Job

Ap. Mon run tim e Mon. ALISA Ali. En Site rss Ali. En Job Agent di us sk ed Ap. Mon 10 ts socke Mon. ALISA @CERN Ag eg at e d. D d Ap. Mon mi g mb rated yte s API Services at a My. SQL Servers Castor. Grid Scripts Ap. Mon nr. o f files roxy My. P us stat Ali. En Job Agent loa ive act ons si ses Mon. ALISA LCG Site cpu ksi 2 k jo sta bs tu s Ap. Mon open files Ap. Mon ed eu ts Qu Agen Job Ali. En Job Agent Ap. Mon job stat us Ali. En SE Cluster Monitor Ali. En CE Ap. Mon gr Ap. Mon Ali. En Job Agent Ap. Mon Ali. En Brokers Ap. Mon Ali. En SE Ap. Mon esse vsz Ali. En TQ Ali. En Optimizers proc cpu e tim Ali. En Job Agent Ap. Mon job slots Ap. Mon fr sp ee ac e Ali. En Job Agent Ali. En IS Cluster Monitor net In/o ut Ali. En CE LCG Tools Mon. ALISA Repository Long History DB http: //pcalimonitor. cern. ch/ Alerts Actions ALIEN MONITORING ARCHITECTURE Monitoring follows the general Ali. En layout: one service per site collects and aggregates site-local monitoring information

THE REPOSITORY Mon. ALISA repository for ALICE

THE REPOSITORY Mon. ALISA repository for ALICE

85 sites defined 9000 worker nodes 14000 parallel jobs 25 central machines More than

85 sites defined 9000 worker nodes 14000 parallel jobs 25 central machines More than 1. 1 M parameters Storing only aggregated data where possible we have reached >35000 active parameters in the database We store new values at ~100 Hz 15000 dynamic pages / day 1 client/web server + 3 database instances 170 GB of history THE REPOSITORY Monitoring the ALICE Grid: some numbers

 • Traffic • Jobs • Hosts • Apps ML Service SSL Actions SSL

• Traffic • Jobs • Hosts • Apps ML Service SSL Actions SSL • Temperature • Humidity • A/C Power • … Sensors Global ML Services Decisions taken upon: -Absence/presence of some parameter(s) -Values above/below predefined thresholds -Arbitrary correlations between values -User-defined code ML Service Local Global decisions Actions Action types: -Notifications -RSS/Atom feeds -Annotation of charts with the events -Calling external programs -Logging -Running custom code, for example in ALICE - restart site services - maintain DNS-based load balancing AUTOMATIC ACTIONS What to do with this monitoring data?

Step 0: basic monitoring support Install a Mon. ALISA service instance or reuse the

Step 0: basic monitoring support Install a Mon. ALISA service instance or reuse the one installed by Ali. En Step 1: instrument the applications with monitoring calls Ap. Mon covers both the host monitoring and sending your own parameters. Use a pointer to the actual configuration Step 2: collect the data Install a Repository and collect only the parameters that you need Step 3: present the data Learning from the past is important. Add annotations of changes Step 4: react to events Start by sending emails when things don’t work. Continue by automatizing the manual procedures Step 5: go to a workshop in Philippines. PUTTING IT ALL TOGETHER How to monitor your system

Main documentation page http: //monalisa. caltech. edu/ Experiment repositories ALICE: http: //pcalimonitor. cern. ch/

Main documentation page http: //monalisa. caltech. edu/ Experiment repositories ALICE: http: //pcalimonitor. cern. ch/ PANDA: http: //mlr 2. gla. ac. uk: 7001/ Cluster monitoring repositories GSI: http: //lxgrid 2. gsi. de: 8080/ Muenster: http: //gridikp. uni-muenster. de: 8080/ CAF: http: //pcalimonitor. cern. ch/stats? page=CAF/machines USEFUL LINKS Exploring ML world can start from here

2 Q||!2 Q ? Thanks a lot for your attention! Any questions?

2 Q||!2 Q ? Thanks a lot for your attention! Any questions?