IT Monitoring Service Status and Progress Alberto AIMAR
IT Monitoring Service Status and Progress Alberto AIMAR CERN-IT for the MONIT Team 2
Outline Monitoring Overview • Demo • 3
General Questions Architecture and data flow • Different main features and how to get to use them (visualizations, processing, etc. ) • Status, plans and main topics/issues • 9/25/2020 4
Architecture and Data Flow 5
Infrastructure Monitoring Job Monitoring Data mgmt and transfers Data Centres Monitoring Current Monitoring Data Sources Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers DPM Servers XROOTD Servers CRAB 2 CRAB 3 WM Agent Farmout Grid Control CMS Connect PANDA WMS Prod. Sys Nagios VOFeed OIM GOCDB REBUS Transport Flume AMQ Kafka Flume AMQ GLED HTTP Collector z SQL Collector Mona. LISA Collector AMQ HTTP GET HTTP PUT Storage &Search HDFS Elastic. Search Oracle Elastic. Search HDFS Oracle Elastic. Search Display Access Kibana Jupyter Zeppelin Dashboards (ED) Kibana Zeppelin Processing & Aggregation Spark Hadoop Jobs GNI Oracle PL/SQL ESPER Spark Oracle PL/SQL ES Queries ESPER Real Time (ED) Accounting (ED) API (ED) SSB (ED) SAM 3 (ED) API (ED)
Unified Monitoring Data Sources Transport Storage &Search Data Access Metrics Manager Lemon Agent XSLS Hadoop HDFS ATLAS Rucio FTS Servers Elastic. Search DPM Servers Influx. DB XROOTD Servers CRAB 2 CRAB 3 WM Agent Kibana Flume Farmout Grid Control Grafana z AMQ Kafka Processing & Aggregation Zeppelin CMS Connect PANDA WMS Prod. Sys Spark Nagios Hadoop Jobs VOFeed OIM GOCDB REBUS Jupyter (Swan) GNI
Unified Monitoring Architecture Data Sources AMQ MONIT AMQ DB MONIT DB HTTP feed MONIT HTTP Kafka cluster (buffering) * XRoot. D Jobs Your Data … Lemon Logs MONIT Logs Lemon metrics MONIT Metrics syslog app log Flume Kafka sink Rucio Processing Data enrichment Data aggregation Batch Processing Data Access HDFS Your Jobs Flume sinks FTS Storage & Search Transport … Others (influxdb) Today: 200 GB/day, 12 h Kafka Target: 500 GB/day, 48 h Kafka Your Views Elastic Search CLI, API
Unified Data Sources FTS AMQ MONIT AMQ DB MONIT DB HTTP feed MONIT HTTP Logs MONIT Log Lemon metrics MONIT Lemon Metrics (http) MONIT Metrics Rucio XRoot. D Jobs Your Data Lemon syslog app log collectd Log. Stash Transport Flume Kafka sink • Data is channeled via Flume, validated and modified if necessary • Adding new Data Sources is documented and fairly simple 9
Unified Processing Transport Kafka cluster (buffering) * Flume sinks Flume Kafka sink Processing Your Jobs (e. g. Enrich FTS transfer metrics with WLCG topology from AGIS/Gocdb) Proven useful many times 10
Data Processing Stream processing Data enrichment • Join information from several sources (e. g. WLCG topology) Data aggregation • Over time (e. g. summary statistics for a time bin) • Over other dimensions (e. g. compute a cumulative metric for a set of machines hosting the same service) Data correlation • Advanced Alarming: detect anomalies and failures correlating data from multiple sources (e. g. data centre topology-aware alarms) Batch processing • Reprocessing, data compression, reports 11
Monitoring Processing Platform Technologies: Reliable and scalable job execution (Spark), Job orchestration and scheduling (Marathon/Chronos), Lightweight and isolation deployment (Docker) Your Jobs 12
Unified Access Storage & Search HDFS sinks ES sinks Data Access HDFS Your Views Elastics earch other sinks … Plots CLI, API e. g. Influx. DB Reports Scripts • Multiple data access methods (dashboards, notebooks, CLI) • Mainstream and evolving technology 13
Data and Visualization Data Storage and Search Elastic. Search Short-term storage and index (months, depends on data) Influx. DB Short-term time series storage (months) HDFS Long-term archive (years) Visualization Kibana Data from Elastic. Search Dashboards and full search/filter/discovery of data Grafana Data from Elastic. Search, Influx. DB Dashboards optimized for time series plots Zeppelin Data from HDFS, Elastic. Search, Influx. DB Notebooks for analysis, reports and plots Native support for Spark Swan (Jupyter) Data from HDFS or Elastic. Search, Influx. DB Notebooks for analysis, reports and plots Integration with HEP toolbox (ROOT, CERNBOX, CVMFS, etc) 14
Status Data Sources and Transport • Moving all data via new transport (Flume, AMQ, Kafka) Storage and Search • Data in ES and HDFS, identical format Processing • Doing aggregation and processing via Spark Display and reports • • Using only standard features of ES, Kibana, Spark, Hadoop Introduce notebooks (Zeppelin, Swan) and data discovery 15
Services Proposed Monitor, collect, visualize, process, aggregate, alarm • Metrics and Logs Infrastructure operations and scale Helping and supporting • • • Interfacing new data sources Developing custom processing, aggregations, alarms Building dashboards and reports 16
Reference and Contact Kibana Dashboards monit. cern. ch Feedback/Requests (SNOW) cern. ch/monit-support Early-Stage Documentation cern. ch/monitdocs 17
Examples • • • Kibana Grafana Zeppelin 18
http: //monit. cern. ch - Kibana homepage Data Available - FTS - XRoot. D - Job Monitoring Real. Time - Job Monitoring Accounting - SAM raw data - ETF ALICE Examples Documentation Link to Custom Projects 19
Transfers Overview 20
Transfer failures over IPV 6 21
Success and failures from a src_site or a domain 22
Investigate and filter for a specific error 23
Filter and then access the raw data and logs 24
Site Overview showing - FTS - Jobs - ETF 25
Grafana Access several sources in the same dashboard METER and MONIT 26
Notebooks with Zeppelin Extract Data from HDFS or ES 27
Manipulate the data and plot with common languages and tools Python Scala numpy 9/25/2020 28
Notebooks with Swan Extract Data from ES ROOT Python C++ CVMFS 9/25/2020 29
Sending Logs - ES - HDFS Open. Stack data 30
Open. Stack Custom fields from logs 9/25/2020 31
Open. Stack dashboards 9/25/2020 32
Job Monitoring Real Time and Accounting Filter by Site, User, cores, etc. 33
Site Overview showing - FTS - Jobs - SAM Raw cern. ch hosts 34
SAM Raw Data 9/25/2020 35
Access to raw data select by any of the fields Important to know the structure of the data 21/07/2016 ASDF meeting 36
Data on service and hosts Select cern. ch hosts 9/25/2020 37
Select the errors (CRITICAL) See all details of the errors 9/25/2020 38
User Documentations cern. ch/monitdocs 39
Common fields 9/25/2020 40
Naming Conventions 9/25/2020 41
Access to the raw data and to plot and select by any of the fields. 21/07/2016 ASDF meeting 42
User Documentations cern. ch/monitdocs 43
UMA – Overview of Security IP Filtering XRoot. D MONIT DB Jobs … MONIT HTTP … Lemon syslog app log MONIT Logs MONIT Metrics ACLS Krb Auth. N + SSL Kafka cluster (buffering) * ACLS on Topics Data enrichment Data aggregation Batch Processing HDFS Flume sinks Rucio MONIT AMQ Flume Kafka sink FTS Krb Auth. N + SSL Elastic Search … Others (influxdb) SSO CLI, API
- Slides: 45