Monitoring Evolution and IPv 6 Alberto AIMAR ITCMMM
Monitoring Evolution and IPv 6 Alberto AIMAR, IT-CM-MM 1
Outline • Context • Data Centres Monitoring • Experiments Dashboards • Architecture • Plans • Status • Demo 2
Monitoring Data Centre Monitoring • • Monitoring of DC at CERN and Wigner Hardware, operating system, and services Data Centres equipment (PDUs, temperature sensors, etc. ) Used by service providers in IT, experiments Experiment Dashboards • • Sites availability, data transfers, job information, reports Used by WLCG, experiments, sites and users Both hosted by CERN IT, in different teams 3
Context Focus for 2016 • Regroup monitoring activities hosted by CERN/IT (Data Centres, Experiment Dashboards, ETF, Hammer. Cloud, etc) • Continue existing services • Uniform with CERN IT practices • Management of services, communication, tools (e. g. GGUS and SNOW tickets) Starting with • Merge Data Centres and Experiment Dashboards monitoring technologies • Review existing monitoring usage and needs (IT, WLCG, etc) • Investigate new technologies • Unchanged support while collecting user feedback 4
Unified Monitoring Architecture Data Sources Transport Storage/Search Views Data Centers Processing WLCG Data kafka 5
Experiment Dashboards Operation Teams Sites Data Management Monitoring Users Operation Teams Job Monitoring Data transfer Data access Analysis + Production Real time and Accounting views Sites 300 -500 users per day General Public Outreach Google Earth Dashboard Infrastructure Monitoring Site Status Board SAM 3 Operation Teams Sites 6
Experiment Dashboards • • Job monitoring, sites availability, data management and transfers Used by experiments operation teams, sites, users, WLCG 7
Current Monitoring Data Centres Monitoring Data Sources Transport Metrics Manager Flume HDFS Lemon Agent AMQ Elastic. Search Kafka Oracle XSLS Job Monitoring Data mgmt and transfers ATLAS Rucio Infrastructure Monitoring Storage &Search Elastic. Search FTS Servers Flume HDFS DPM Servers AMQ Oracle XROOTD Servers GLED Elastic. Search Display & Reports Kibana Jupyter Zeppelin Dashboards (ED) CRAB 2 Oracle Kibana CRAB 3 Elastic. Search Zeppelin WM Agent HTTP Collector z Farmout SQL Collector Grid Control Mona. LISA Collector Processing & Aggregation Spark CMS Connect Hadoop Jobs PANDA WMS GNI Prod. Sys Oracle PL/SQL Nagios ESPER AMQ Spark OIM HTTP GET Oracle PL/SQL GOCDB HTTP PUT ES Queries VOFeed REBUS ESPER Real Time (ED) Accounting (ED) API (ED) SSB (ED) SAM 3 (ED) API (ED)
Unified Monitoring Data Sources Transport Storage &Search Display & Reports Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers Hadoop HDFS DPM Servers Elastic. Search XROOTD Servers Other CRAB 2 CRAB 3 WM Agent Farmout Flume Kibana AMQ z Kafka Jupyter Processing & Aggregation Other Grid Control CMS Connect PANDA WMS Prod. Sys Nagios VOFeed OIM GOCDB REBUS Zeppelin Spark Hadoop Jobs GNI Other
Status Producers and Transport • Moving all data via new transport (Flume, AMQ, Kafka) Storage and Search • Data in ES and Hadoop Processing • Doing aggregation and processing via Spark Display and reports • • Experimenting using only the standard features of ES, Kibana, Spark, Hadoop Introduce notebooks and data discovery General • • Selecting technologies, learning on the job, looking for expertise Evolve interfaces (e. g. dashbords for users, shifters, sites, managers) 10
IPv 6 and Monitoring for WLCG Data Sources Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers DPM Servers We are confident that there are no major issues : • No major changes vs the check in 2013 • Evolution to the new architecture will take IPv 6 into account • Using the main stream technologies, very little code of our own XROOTD Servers CRAB 2 CRAB 3 WM Agent Farmout Grid Control CMS Connect PANDA WMS Prod. Sys Nagios VOFeed OIM GOCDB REBUS Data sources • Relies on external systems providing monitoring data • Depends in data provided by FTS, Rucio, Panda, CRAB 3, Xrootd, etc. • Mon. ALISA is external, used by ALICE and other projects (tested by ML devs) Transport • Receives data via AMQ/Stomp, Flume, UDP, databases and HTTP sources. • It is matter of staying up to date with ipv 6 -ready versions 11
IPv 6 and Monitoring for WLCG Storage &Search Transport Flume AMQ Hadoop HDFS Elastic. Search Other z Kafka Display & Reports Kibana Jupyter Processing & Aggregation Zeppelin Spark Other Hadoop Jobs GNI Other Storage • Storing mostly the host names only, as strings • In a few cases the current Experiment Dashboards may store WN IP and will be fixed in the migration • Elastic. Search has an IPv 4 data type, but not IPv 6 at the moment. Will come. Display and reports • Only showing IPv 4 and IPv 6 hosts, names as strings • Web applications can easily be made reachable by IPv 6 nodes, actions 12 will be needed (just like any other web server)
Plans Unified architecture and technologies • • Focus on migrating to common architecture Review the existing architecture, areas and data Update to new technologies in several areas • • • Better perfomance and new versions with new features and major improvements Look into technologies as needed (collectd, Kafka, Grafana, etc. ) Benefit from experience and feedback received from Experiments , WLCG and IT groups Move to central services • • Central service for ES is being created, Influx. DB for time series DBo. D Continue to use central Hadoop services Continue with standard operations and upgrades • • At least for all 2016 Make available the new monitoring platform, in parallel with the existing ones 13
Conclusions • • • No major changes vs. What reported in 2013 Mainstream technologies benefit from community effort and/or official support for IPv 6 readiness. Evolution to the new architecture is used to review the whole monitoring data IPv 6 is one of the reviews we will do No specific issues for monitoring detected 14
Demo Data in Elastic. Search • FTS and Xrootd transfers data • • Examples of Dashboards Data discovery and error investigation • Specific views for specific tasks • • • VO overview Site manager 15
Data Centres Monitoring 17
Monitoring Technologies Area Services and Components Technologies Functions Metric Manager CERN Metrics registration Lemon Agent CERN+Flume Metrics producers (about 15000) Gateway Flume Transport host metrics XSLS Flume Service metrics Messaging Active MQ Messaging of metrics River Kafka Streaming of metrics Aggregation Foz Spark Processing streamed metrics Archive HDFS Hadoop Long term storage Meter Elastic. Search + Kibana Dashboard for metrics Timber Elastic. Search+ Kibana Dashboard for logs Meter Proxy CERN CLI and HTTP Interface to ES GNI CERN Alarms handling Data Collectors Transport Displays Alerts 1 8
- Slides: 18