Intelligent Platform Management Interface IPMI Monitoring and Control

  • Slides: 23
Download presentation
Intelligent Platform Management Interface (IPMI) Monitoring and Control Ian Collier RAL Tier 1 Fabric

Intelligent Platform Management Interface (IPMI) Monitoring and Control Ian Collier RAL Tier 1 Fabric Team July 2 nd 2009 HEPSYSMAN With apologies/thanks to Massimiliano Masi at CERN

IPMI at RAL Tier 1 • At RAL Tier 1 we are just beginning

IPMI at RAL Tier 1 • At RAL Tier 1 we are just beginning rolling out significant use of IPMI • In our new building we’re able to implement a separate management network for IPMI, APC PDUs etc

What and Why • Started in 1998, IPMI is now at revision 2. 0

What and Why • Started in 1998, IPMI is now at revision 2. 0

What and Why • Started in 1998, IPMI is now at revision 2. 0

What and Why • Started in 1998, IPMI is now at revision 2. 0 • Is a standard accepted by DELL, IBM, SUN, INTEL and many others including Super. Micro of course

What and Why • Started in 1998, IPMI is now at revision 2. 0

What and Why • Started in 1998, IPMI is now at revision 2. 0 • Is a standard accepted by DELL, IBM, SUN, INTEL and many others including Super. Micro of course • Goal 1: IPMI is a spec for monitoring and controlling the machine via special hardware, the Baseboard Management Controller, BMC

What and Why • Started in 1998, IPMI is now at revision 2. 0

What and Why • Started in 1998, IPMI is now at revision 2. 0 • Is a standard accepted by DELL, IBM, SUN, INTEL and many others including Super. Micro of course • Goal 1: IPMI is a spec for monitoring and controlling the machine via special hardware, the Baseboard Management Controller, BMC • Goal 2: Serial Over Lan (SOL). This is a method to redirect serial connections over an ethernet cable. • Many cards now also provide KVM over LAN – eliminating need for expensive network KVMs!

What and Why? Major IPMI concepts: • Sensors (Fans speed, CPU Temperature, voltage) •

What and Why? Major IPMI concepts: • Sensors (Fans speed, CPU Temperature, voltage) • Events (What the BMC should do when the CPU temperature reach 100 degrees? SNMP Traps) • SDR (Sensor data repository, where the data are collected) • SEL (System Event Log, a log of all critical situation) • Session (Between the client and the BMC)

What and Why? SECURITY • Can define users • Can define privileges • Can

What and Why? SECURITY • Can define users • Can define privileges • Can encrypt communication with BMC The security depends on the version of the specification

What and Why? SECURITY • Can define users • Can define privileges • Can

What and Why? SECURITY • Can define users • Can define privileges • Can encrypt communication with BMC The security depends on the version of the specification • Version 2. 0: RMCP/RMCP+: based on RAKP messages (HMAC like protocol) • Serial-Over-Lan is encrypted with RMCP+ only

Manufacturers provide GUIs

Manufacturers provide GUIs

Open source tools • Open. IPMI (ipmitool) • Lmsensors • Freeipmi (no drivers)

Open source tools • Open. IPMI (ipmitool) • Lmsensors • Freeipmi (no drivers)

ipmitool sensor local output [root@lcg 0954 ~]# ipmitool sensor CPU Temp 1 | 35.

ipmitool sensor local output [root@lcg 0954 ~]# ipmitool sensor CPU Temp 1 | 35. 000 | degrees C CPU Temp 2 | 34. 000 | degrees C CPU Temp 3 | na | degrees C CPU Temp 4 | na | degrees C Sys Temp | 31. 000 | degrees C CPU 1 Vcore | 1. 184 | Volts CPU 2 Vcore | 1. 192 | Volts 3. 3 V | 3. 264 | Volts 5 V | 4. 920 | Volts 12 V | 11. 712 | Volts 1. 5 V | 1. 488 | Volts 5 VSB | 4. 896 | Volts VBAT | 3. 280 | Volts Fan 1 | 10500. 000 | RPM Fan 2 | 8700. 000 | RPM Fan 3 | 10500. 000 | RPM Fan 4 | 8700. 000 | RPM Fan 5 | 10400. 000 | RPM Fan 6 | 8800. 000 | RPM Fan 7 | 0. 000 | RPM Fan 8 | 0. 000 | RPM Power Supply | 0 x 0 | discrete CPU 0 Internal E | 0 x 0 | discrete CPU 1 Internal E | 0 x 0 | discrete CPU Overheat | 0 x 0 | discrete Thermal Trip 0 | 0 x 0 | discrete Thermal Trip 1 | 0 x 0 | discrete | | | | | | | ok | na | ok | ok | ok | ok | nr | 0 x 0000| 0 x 0000| na na na 0. 680 2. 912 4. 416 10. 464 1. 296 4. 416 2. 912 200. 000 200. 000 na na na | | | | | | | na na na 0. 688 2. 928 4. 440 10. 560 1. 312 4. 440 2. 928 300. 000 300. 000 na na na | | | | | | | na na na 0. 696 2. 944 4. 464 10. 656 1. 328 4. 464 2. 944 400. 000 400. 000 na na na | | | | | | | 76. 000 1. 624 3. 648 5. 520 13. 344 1. 664 5. 520 3. 648 na na na na | | | | | | | 78. 000 1. 632 3. 664 5. 544 13. 440 1. 680 5. 544 3. 664 na na na na | | | | | | | 80. 000 1. 640 3. 680 5. 568 13. 536 1. 696 5. 568 3. 680 na na na na

ipmitool sensor remote output # ipmitool -I lanplus -H 172. 16. 177. 64 -U

ipmitool sensor remote output # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN sensor get'CPU 1 Temp’ Password: Locating sensor record. . . Sensor ID : CPU 1 Temp (0 x 0)Entity ID : 3. 0 Sensor Type (Discrete): OEM reserved #c 0

ipmitool sensor remote output # ipmitool -I lanplus -H 172. 16. 177. 64 -U

ipmitool sensor remote output # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN sensor get'CPU 1 Temp’ Password: Locating sensor record. . . Sensor ID : CPU 1 Temp (0 x 0)Entity ID : 3. 0 Sensor Type (Discrete): OEM reserved #c 0 Note that the lanplus option encrypts communication including passwords

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power off Password: Chassis Power Control: Down/Off

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power off Password: Chassis Power Control: Down/Off # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power status Password: Chassis Power is off

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U

ipmitool remote power control # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power off Password: Chassis Power Control: Down/Off # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power status Password: Chassis Power is off # ipmitool -I lanplus -H 172. 16. 177. 64 -U ADMIN power on Password: Chassis Power Control: Up/On

ipmitool remote power control = less visits to the machine room!

ipmitool remote power control = less visits to the machine room!

ipmitool serial over lan # ipmitool -I lanplus -H 172. 16. 176. 143 –U

ipmitool serial over lan # ipmitool -I lanplus -H 172. 16. 176. 143 –U ADMIN sol activate Password: [SOL Session operational. Use ~? for help] Scientific Linux SL release 4. 6 (Beryllium) Kernel 2. 6. 9 -78. 0. 22. ELsmp on an i 686 gdss 328. gridpp. rl. ac. uk login:

ipmitool serial over lan = even less visits to the machine room!

ipmitool serial over lan = even less visits to the machine room!

ipmitool sensor local output [root@lcg 0954 ~]# ipmitool sensor CPU Temp 1 | 35.

ipmitool sensor local output [root@lcg 0954 ~]# ipmitool sensor CPU Temp 1 | 35. 000 | degrees C CPU Temp 2 | 34. 000 | degrees C CPU Temp 3 | na | degrees C CPU Temp 4 | na | degrees C Sys Temp | 31. 000 | degrees C CPU 1 Vcore | 1. 184 | Volts CPU 2 Vcore | 1. 192 | Volts 3. 3 V | 3. 264 | Volts 5 V | 4. 920 | Volts 12 V | 11. 712 | Volts 1. 5 V | 1. 488 | Volts 5 VSB | 4. 896 | Volts VBAT | 3. 280 | Volts Fan 1 | 10500. 000 | RPM Fan 2 | 8700. 000 | RPM Fan 3 | 10500. 000 | RPM Fan 4 | 8700. 000 | RPM Fan 5 | 10400. 000 | RPM Fan 6 | 8800. 000 | RPM Fan 7 | 0. 000 | RPM Fan 8 | 0. 000 | RPM Power Supply | 0 x 0 | discrete CPU 0 Internal E | 0 x 0 | discrete CPU 1 Internal E | 0 x 0 | discrete CPU Overheat | 0 x 0 | discrete Thermal Trip 0 | 0 x 0 | discrete Thermal Trip 1 | 0 x 0 | discrete | | | | | | | ok | na | ok | ok | ok | ok | nr | 0 x 0000| 0 x 0000| na na na 0. 680 2. 912 4. 416 10. 464 1. 296 4. 416 2. 912 200. 000 200. 000 na na na | | | | | | | na na na 0. 688 2. 928 4. 440 10. 560 1. 312 4. 440 2. 928 300. 000 300. 000 na na na | | | | | | | na na na 0. 696 2. 944 4. 464 10. 656 1. 328 4. 464 2. 944 400. 000 400. 000 na na na | | | | | | | 76. 000 1. 624 3. 648 5. 520 13. 344 1. 664 5. 520 3. 648 na na na na | | | | | | | 78. 000 1. 632 3. 664 5. 544 13. 440 1. 680 5. 544 3. 664 na na na na | | | | | | | 80. 000 1. 640 3. 680 5. 568 13. 536 1. 696 5. 568 3. 680 na na na na

Gathering IPMI metrics in Ganglia • • Perl script runs ipmitool sensor and pulls

Gathering IPMI metrics in Ganglia • • Perl script runs ipmitool sensor and pulls out non null values Metric labels vary with manufacturer and specific BMC Test deployment at: http: //ganglia. gridpp. rl. ac. uk/ganglia/? m=load_one&r=hour&s=desce nding&c=Workers_SL 4&h=lcg 0954. gridpp. rl. ac. uk

Future • Our new hardware has BMCs that support KVM over lan as well

Future • Our new hardware has BMCs that support KVM over lan as well – with Super. Micro’s web interface • The data gathered by Ganglia can be mined for very granular information about the conditions in the machine room – indicating airflow problems etc. • Useful in diagnosing hardware problems after the event • Configure snmp traps for alarms