Fabric Management Olof Barring ITFIO CERN IT Department
Fabric Management Olof Barring IT/FIO CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 1
Summary • Services for physics • Fabric management at CERN • Tools and frameworks – Configuration - Quattor – Monitoring – Lemon – Logistic management workflows - LEAF CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 2
Services for physics Interactive clusters Grid services CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 Batch service (LSF) Managed Storage service (CASTOR) Shared file system (AFS) 3
Services for physics Grid services CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Batch service (LSF) ~200 nodes Interactive clusters ~130 nodes, ~10 distinct services 22/06/2009 ~1900 nodes Managed Storage service (CASTOR) ~1400 nodes 6. 5 PB disk 153 tape drives 43 PB tape media Shared file system (AFS) ~100 nodes 4
Fabric management at CERN • Fabric management challenges – Clusters of heterogeneous and unreliable hardware – Frequent intrusive interventions: Kernel rollouts, firmware updates, … – Complex services that in general require engineering level expert knowledge • Frameworks and tools – – Configuration management: Quattor Monitoring: Lemon Service Level Status: SLS Logistic management and workflows: LEAF • Service management based on same set of tools, procedures and workflows CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 5
Vision Leaf Logistical Management Lemon Performance & Exception Monitoring Node Configuration Management Node Management CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 6
Quattor Configuration server Used by 18 organisations besides CERN; including two distributed implementations with 5 and 18 sites. SQL scripts SOAP CLI GUI SQL backend CDB XML backend HTTP XML configuration profiles SW server(s) Comp. A Comp. B Comp. C RPMs Service. AService. BService. C RPMs / PKGs www. cern. ch/it SW Package Manager SPMA 22/06/2009 Managed Nodes base OS HTTP / PXE HTTP SW Repository CERN - IT Department CH-1211 Genève 23 Switzerland Install server Node Configuration Manager NCM Install Manager System installer 7
Configuration Hierarchy CERN CC cluster_name: lxbatch master: lxmaster 01 pkg_add (lsf 7. 0) p lx 00 s lu CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 1 eth 0/ip: 192. 168. 0. 246 pkg_add (lsf 7. 0_debug) 22/06/2009 name_srv 1: 192. 168. 5. 55 time_srv 1: ip-time-1 cluster_name: lxplus disk_srv pkg_add (lsf 7. 0) lxplus 02 eth 0/ip: 192. 168. 0. 41 lxp s 0 lu 9 22 s lu p lx 8
Lemon SQL Monitoring Repository SOAP Correlation Engines Repository backend RRDTool / PHP apache HTTP TCP/UDP Nodes Monitoring Agent Sensor CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Sensor 22/06/2009 Sensor Lemon CLI Web browser User Workstations 9
What is monitored? • Node based Lemon sensors cover all the usual system parameters and more – system load, file system usage, network traffic, daemon count, software version… – SMART monitoring for disks – Oracle monitoring • number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … – AFS client monitoring – … • It is also possible to provide “non-node” sensors. At CERN these allow integration of – information from the building management system • Power demand, UPS status, temperature, … – and full feedback is possible (although not implemented): e. g. system shutdown on power failure – high level mass-storage and batch service details • Queue lengths, file lifetime on disk, … – hardware reliability data CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 10
Monitoring displays CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 11
Service oriented views • Service overviews for end-users – Availability probes imitating basic end-user usage, e. g. copy file in/out from castor CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 12
Workflows: LHC Era Automated Fabric - LEAF • • LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: HMS (Hardware Management System): – – • SMS (State Management System): – – – • Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement Automatically requests installs, retires etc. to technicians GUI to locate equipment physically HMS implementation is CERN specific, but concepts and design should be generic Automated handling (and tracking of) high-level configuration steps • Reconfigure and reboot all LXPLUS nodes for new kernel and/or physical move • Drain and reconfig nodes for diagnosis / repair operations Issues all necessary (re)configuration commands via Quattor extensible framework – plug-ins for site-specific operations possible Example: SMS transition Production Standby of a CASTOR disk server: – Running transfers continue – No new transfers scheduled – Outbound transfers for tape migration or file replication allowed CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 13
Integration in Action • Simple – Operator alarms masked according to system state • Complex – Disk and RAID failures detected on disk storage nodes lead automatically to a reconfiguration of the mass storage system: set Standby SMS Mass Storage System set Draining Alarm Analysis Alarm Monitor LEMON Alarm Disk Server Lemon Agent RAID degraded Draining: no new connections allowed; existing data transfers continue. CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 14
Conclusions • Same frameworks, tools, procedures and workflows are used for all physics computing services – ~4000 nodes – ~15 distinct services, some very complex (castor, wms) – End user support • Concise framework for automation – Monitoring: Lemon (http: //cern. ch/lemon ) • Performance • Exceptions, Alarms • Automated actions (actuators) – Service views: SLS (http: //cern. ch/sls ) – Configuration: Quattor (http: //quattor. org ) – Logistics management and workflows: LEAF (http: //cern. ch/leaf ) CERN - IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it 22/06/2009 15
- Slides: 15