Reliability of Embedded FPGASo C Systems in LongTerm
Reliability of Embedded FPGA-So. C Systems in Long-Term Physics Experiments Robin Bauknecht Institute for Data Processing and Electronics (IPE) KIT – The Research University in the Helmholtz Association www. kit. edu
ECHo Measure Electron. Neutrino mass 15 boards 1 Year run time 2 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
CMS Track Trigger High energy particle physics Over 100 boards Multiple years run time 3 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
CMS Track Trigger High energy particle physics Over 100 boards Multiple years run time 4 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Target System 5 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
ZYNQ Ultra. Scale+ Block Diagram 6 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Dependency Analysis 7 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Dependency Analysis 8 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Dependency Analysis 9 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Dependency Analysis 10 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Created Dependency Graphs 11 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Fault Tree Analysis Legend 12 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Fault Tree Analysis Legend Avoid 13 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Created Fault Trees 12 Fault Trees 14 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Fault Tree Example No probability calculation Failure causes in the leaf nodes 15 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failure Mode and Effect Analysis 73 Entries 16 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failure Mode and Effect Analysis Details Five device states Operational Warning Degraded Failover Failed 17 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Possible Mitigation Techniques Mitigation Benefit Overhead Very High Failover High Medium Monitoring High Medium APU Error Detection Medium High DDR 4 ECC Medium Low FPGA TMR Low High Memory Isolation Low Modern File System Low Plugin Reload Low Medium RAID Low/High Very Low Medium Redundancy FPGA Scrubbing 18 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Possible Mitigation Techniques Mitigation Benefit Overhead Very High Failover High Medium Monitoring High Medium APU Error Detection Medium High DDR 4 ECC Medium Low FPGA TMR Low High Memory Isolation Low Modern File System Low Plugin Reload Low Medium RAID Low/High Very Low Medium Redundancy FPGA Scrubbing 19 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Monitoring 20 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Monitoring 21 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Monitoring 22 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Monitoring Modular & Extensible 23 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Monitoring Framework Monitor Monit Custom Checks 24 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failover Improvement for 25 FMEA items possible 25 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Boot Sequence HW Init 26 27 October 2021 Robin Bauknecht Boot Mode SD Load U-Boot OS Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence HW Init 27 27 October 2021 Robin Bauknecht Boot Mode QSPI Load U-Boot FOS Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence Load U-Boot HW Init 28 27 October 2021 Robin Bauknecht Boot Mode QSPI Boot FOS Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence Load U-Boot HW Init 29 27 October 2021 Robin Bauknecht Boot Mode QSPI FOS? Yes Boot FOS Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence Load U-Boot HW Init Boot Mode QSPI FOS? Yes Boot FOS No Boot OS 30 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence Load U-Boot HW Init Boot Mode Register + System Reset 31 27 October 2021 Robin Bauknecht Boot Mode QSPI FOS? Yes Boot FOS No Boot Mode SD Load U-Boot OS Institute for Data Processing and Electronics (IPE)
Failover Boot Sequence Load U-Boot Loop Counter++ HW Init Boot Mode QSPI Count > 3? Yes Boot FOS No Boot Mode SD Load U-Boot OS Reset Loop Counter 32 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failover Results Tests 13 FMEA items covered 7 items not improved 5 items untested Characteristics Boot time increased from 21 to 28 seconds Failover image size ~20 MB 33 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Failover Real Life Application Remote hardware access without bricking 34 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Conclusion Analysis allows evaluation Monitoring makes errors visible Failover enables remote maintenance Development usage demonstrates effectiveness 35 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
Further Improvements Boot Loop Detection Modern FS Memory 36 27 October 2021 Robin Bauknecht Institute for Data Processing and Electronics (IPE)
- Slides: 36