CWG 10 Control Configuration and Monitoring Status and

  • Slides: 19
Download presentation
CWG 10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring

CWG 10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop 2014@Pusan

Outline ▶ ▶ ▶ Motivation A brief overview of data taking operations Lessons learned

Outline ▶ ▶ ▶ Motivation A brief overview of data taking operations Lessons learned from Run 1 CCM Overview Performance tests Next steps ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 2

Motivation ▶ Why do we need a Control System ? ▶ Start and stop

Motivation ▶ Why do we need a Control System ? ▶ Start and stop processes ▶ Sequence of operations, synchronization ▶ External systems ▶ Automation ▶ Why do we need a Configuration System ? ▶ Configure processes ▶ Why do we need a Monitoring System ? ▶ Detect abnormal conditions ▶ Automation ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 3

Team ▶ CERN ▶ KMUTT, Thailand ▶ See next presentation by Khanasin for an

Team ▶ CERN ▶ KMUTT, Thailand ▶ See next presentation by Khanasin for an update ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 4

A brief overview of data taking operations ▶ A typical LHC year Jan Feb

A brief overview of data taking operations ▶ A typical LHC year Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Shutdown for maintenance proton-proton collisions Heavy-ion collisions Disclaimer: current system, not O 2 ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 5

A brief overview of data taking operations ▶ A typical LHC Fill (up to

A brief overview of data taking operations ▶ A typical LHC Fill (up to 30 hours) Jan Feb Beam Injection Mar Apr May Jun July Aug Sep Stable beams Oct Nov Dec Beam dump • ALICE safe • Detector calibration • Prepare • Partial ALICE READY trigger configuration Ideally a single run • Full ALICE READY • Data taking • Detector calibration Disclaimer: current system, not O 2 ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 6

A brief overview of data taking operations ▶ A typical ALICE run Start-of. Run

A brief overview of data taking operations ▶ A typical ALICE run Start-of. Run • Config detectors electronics • Start online systems • Store data taking conditions Data taking End-of. Run • Readout • Event building • Online data monitoring • Online calibration data • Export data taking conditions and calibration data to Offline • Stop online systems Disclaimer: current system, not O 2 ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 7

A brief overview of data taking operations ▶ Run 1 SOR sequence (high level)

A brief overview of data taking operations ▶ Run 1 SOR sequence (high level) Disclaimer: current system, not O 2 ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 8

Lessons learned from Run 1 (2010 -2013) ▶ Must be fast when changing run

Lessons learned from Run 1 (2010 -2013) ▶ Must be fast when changing run ▶ More runs than expected ▶ Not everything needs to be restarted Run 2: Fast SOR/EOR ▶ Must be flexible Run 2: Pause ▶ Not every problem needs to stop a run and Recover ▶ Must monitor everything Run 2: MAD ▶ Data flow monitoring ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 9

Control in O 2 - Overview ▶ Process Management ▶ Start/stop processes ▶ Send

Control in O 2 - Overview ▶ Process Management ▶ Start/stop processes ▶ Send commands to processes (CONFIGURE, PAUSE/RESUME, etc. ) ▶ Estimated: O(100 k) processes ▶ Task Management ▶ Ensure that actions are executed in the correct order ▶ Automation ▶ Automatically recover from errors ▶ Automatically react to internal events (e. g. need more EPNs), external events (e. g. start of LHC collisions) ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 10

Control in O 2 - Notes ▶ Includes processes from online and offline ▶

Control in O 2 - Notes ▶ Includes processes from online and offline ▶ Must control both synchronous and asynchronous tasks ▶ Cannot be seen as a batch system ▶ Bound to external events (e. g. start of collisions) ▶ Sequence of operations, synchronization points ▶ Low latency very important ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 11

Configuration in O 2 - Overview ▶ Configuration distribution ▶ Provide processes with needed

Configuration in O 2 - Overview ▶ Configuration distribution ▶ Provide processes with needed configuration parameters ▶ Dynamic process (re)configuration ▶ Essential to achieve fast run transition ▶ O(1 GB) of configuration data ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 12

Monitoring in O 2 - Overview ▶ Data collection and archival ▶ System monitoring

Monitoring in O 2 - Overview ▶ Data collection and archival ▶ System monitoring (CPU, memory, I/O, etc. ) ▶ Application monitoring (data rates, link backpressure, internal buffer status, etc. ) ▶ O(600 KHz) of monitoring data ▶ Alarms and action triggering ▶ Support shift crew, experts ▶ Feedback to Control system ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 13

Monitoring in O 2 - Notes ▶ Includes metrics from online and offline ▶

Monitoring in O 2 - Notes ▶ Includes metrics from online and offline ▶ Includes both low and high frequency metrics ▶ Low: every 30 seconds, system metrics ▶ High: every second, link status ▶ Permanent storage will be the limiting factor ▶ No need to store everything, can filter “interesting” values ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 14

Performance Tests: Control ▶ Tool: SMI (State Machine Interface) ▶ Setup: ▶ Level 0

Performance Tests: Control ▶ Tool: SMI (State Machine Interface) ▶ Setup: ▶ Level 0 SMI domain: Partition CCM ▶ Level 1 SMI domain: Detector CCMs EPN Cluster CCM ▶ Level 2 SMI domain: FLP CCMs, EPN CCMs ▶ Level 2 SMI proxy: local process ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 15

Performance Tests: Control ▶ Setup: ▶ 46 hosts ▶ 1 Level 0 domain ▶

Performance Tests: Control ▶ Setup: ▶ 46 hosts ▶ 1 Level 0 domain ▶ 20 Level 1 domains ▶ 1350 Level 2 domains ▶ 67500 proxies ▶ Increase due to initial lookup in DIM DNS ▶ Conclusion: cannot use in current version ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 16

Performance Tests: Monitoring ▶ Mon. ALISA + Ap. Mon ▶ Setup: ▶ 10 sender

Performance Tests: Monitoring ▶ Mon. ALISA + Ap. Mon ▶ Setup: ▶ 10 sender nodes, up to 1000 threads per host (Ap. Mon) ▶ 1 Mon. ALISA service, all historical record disabled ▶ Result: 52 KHz without data loss ▶ Conclusion: could use 12+ collectors to reach 600 KHz By Costin Grigoras ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 17

Performance Tests: Monitoring ▶ Zabbix ▶ Setup: ▶ 10 sender nodes, up to 10

Performance Tests: Monitoring ▶ Zabbix ▶ Setup: ▶ 10 sender nodes, up to 10 processes per host ▶ 1 Zabbix Server node, 200 threads, permanent storage disabled (in-memory history enabled) ▶ Result: 30 KHz without data loss ▶ Conclusion: could use 20+ collectors to reach 600 KHz By Andres Gomez Ramirez ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 18

Next steps ▶ Finalise TDR ▶ Perform more tests: ▶ Control: boost library +

Next steps ▶ Finalise TDR ▶ Perform more tests: ▶ Control: boost library + Zero. MQ ▶ Configuration: Zoo. Keeper ▶ Monitoring: Mon. ALISA, Zabbix with permanent storage ▶ Provide CCM systems for ALFA prototype (CWG 13) ▶ Refine design ALICE O 2 CWG 10 Control, Configuration and Monitoring | ALICE O 2 Asian Workshop 2014 19