Industrial Automation Industrielle Automation 9 2 Dependability Evaluation

Industrial Automation Industrielle Automation 9. 2 Dependability - Evaluation Estimation de la fiabilité Verlässlichkeitsabschätzung Dr. Jean-Charles Tournier CERN, Geneva, Switzerland 2015 - JCT The material of this course has been initially created by Prof. Dr. H. Kirrmann and adapted by Dr. Y-A. Pignolet & J-C. Tournier

Dependability Evaluation This part of the course applies to any system that may fail. Dependability evaluation (fiabilité prévisionnelle, Verlässlichkeitsabschätzung) determines: • the expected reliability, • the requirements on component reliability, • the repair and maintenance intervals and • the amount of necessary redundancy. Dependability analysis is the base on which risks are taken and contracts established Dependability evaluation must be part of the design process, it is quite useless once a system has been put into service. Industrial Automation Dependability – Evaluation 9. 2 - 2

9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov models 9. 2. 5 Availability evaluation 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 3

Reliability = probability that a mission is executed successfully (definition of success? : a question of satisfaction…) Reliability depends on: • duration (“tant va la cruche à l’eau…. ”, "der Krug geht zum Brunnen bis er bricht)) • environment: temperature, vibrations, radiations, etc. . . R(t) 1, 0 lim R(t) = 0 t 25º 85º 1 2 3 laboratory 40º vehicle 4 85º 5 6 time Such graphics are obtained by observing a large number of systems, or calculated for a system knowing the expected behaviour of the elements. Industrial Automation Dependability – Evaluation 9. 2 - 4

Reliability and failure rate - Experimental view Experiment: large quantity of light bulbs 100% t remaining good bulbs R(t) time aging infancy mature t t + Dt time Reliability R(t): number of good bulbs remaining at time t divided by initial number of bulbs Failure rate (t): number of bulbs that failed in interval t, t+Dt, divided by number of remaining bulbs Industrial Automation Dependability – Evaluation 9. 2 - 5

Bathtube Curve Empirical studies showed that the evolution of the failure rate over time usually follows a “bathtube” curve. Infant Mortality A typical bathtube curve comprises three phases: • Infant mortality • Failure rate is decreasing • Useful life • Failure rate is constant • End of life • Failure rate is increasing Useful life End of life time Reminder: a bathtube curve does not depict the failure rate of a single item, but describes the relative failure rate of an entire population of products over time Industrial Automation Dependability – Evaluation 9. 2 - 6

Hardware Failure Hardware failures during a products life can be attributed to the following causes: • Design failures: • This class of failures take place due to inherent design flaws in the system. In a well-designed system this class of failures should make a very small contribution to the total number of failures. • Infant Mortality: • This class of failures cause newly manufactured hardware to fail. This type of failures can be attributed to manufacturing problems like poor soldering, leaking capacitor etc. These failures should not be present in systems leaving the factory as these faults will show up in factory system burn in tests. • Random Failures: • Random failures can occur during the entire life of a hardware module. These failures can lead to system failures. Redundancy is provided to recover from this class of failures. • Wear Out: • Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. This type of faults can be weeded out by preventive maintenance and routing of hardware. Industrial Automation Dependability – Evaluation 9. 2 - 7

Infant Mortality • For critical system, infant mortality is unacceptable • • • Stress testing • • Stress test and burn-in tests should be implemented Stress tests are used to identify failure root cause (design, process, material) Burn-in tests are used to identify failure for which root cause can not be found Both tests are similar, but one is implemented before a massive production (stress test), while the other one is implement on the product leaving the factory (burn-in) Should be started at the earliest development phases and used to evaluate design weaknesses and uncover specific assembly and materials problems. The failures should be investigated and design improvements should be made to improve product robustness. Such an approach can help to eliminate design and material defects that would otherwise show up with product failures in the field. Parameters: temperature, humidity, vibrations, etc. Burn-in tests • • • Ensure that a device or system functions properly before it leaves the manufacturing plant For example, running a new computer for several days before committing it to its real intent For ships or craft, and in general for complete system, burn-in tests are called shakedown tests Industrial Automation Dependability – Evaluation 9. 2 - 8

Reliability R(t) definition failure good bad Reliability R(t): probability that a system does not enter a terminal state until time t, while it was initially in a good state at time t=0" R(0) = 1; lim R(t) = 0 t Failure rate (t) = probability that a (still good) element fails during the next time unit dt. d. R(t) / dt definition: (t) = – R(t) 1 t – R(t) = e (x) dx 0 t MTTF = mean time to fail = surface below R(t) definition: MTTF = 0 R(t) dt 0 Industrial Automation Dependability – Evaluation 9. 2 - 9

Assumption of constant failure rate (t) bathtub Reliability = probability of not having failed until time t expressed: aging childhood (burn-in) by discrete expression R (t+Dt) = R (t) - R (t)*Dt mature t R(t) 1 0. 8 R(t)= e -0. 001 t R (t) = e - t ( = 0. 001/h) assumption of = constant is justified by experience, simplifies computations significantly 0. 6 R(t) = bathtub 0. 4 by continuous expression simplified when = constant 0. 2 MTTF = mean time to fail = surface below R(t) 0 t MTTF Industrial Automation MTTF = 0 e - t dt = 1 Dependability – Evaluation 9. 2 - 10

Examples of failure rates To avoid the negative exponentials, values are often given in FIT (Failures in Time), 1 1 fit = 10 -9 /h = 114'000 years FIT reports the number of expected failures per one billion hours of operation for a device. This term is used particularly by the semiconductor industry. Element Rating failure rate resistor 0. 25 W 0. 1 fit capacitor (dry) 100 n. F 0. 5 fit capacitor (elect. ) 100 F 10 fit processor 486 500 fit RAM 4 MB 1 fit Flash 4 MB 12 fit FPGA 5000 gates 80 fit PLC compact 6500 fit digital I/O 32 points 2000 fit analog I/O 8 points 1000 fit battery per element 400 fit VLSI per package 100 fit soldering per point 0. 01 fit These figures can be obtained from catalogues such as MIL Standard 217 F or from the manufacturer’s data sheets. Warning: Design failures outweigh hardware failures for small series Industrial Automation Dependability – Evaluation 9. 2 - 11

MIL HDBK 217 (1) MIL Handbook 217 B lists failure rates of common elements. Failure rates depend strongly on the environment: temperature, vibration, humidity, and especially the location: - Ground benign, fixed, mobile - Naval sheltered, unsheltered - Airborne, Inhabited, Uninhabited, cargo, fighter - Airborne, Rotary, Helicopter - Space, Flight Usually the application of MIL HDBK 217 results in pessimistic results in terms of the overall system reliability (computed reliability is lower than actual reliability). To obtain more realistic estimations it is necessary to collect failure data based on the actual application instead of using the generic values from MIL HDBK 217. Industrial Automation Dependability – Evaluation 9. 2 - 12

Failure rate catalogue MIL HDBK 217 (2) Stress is expressed by lambda factors Basic models: – discrete components (e. g. resistor, transistor etc. ) = b p. E p. Q p. A – integrated components (ICs, e. g. microprocessors etc. ) = p. Q p. L (C 1 p. T p. V + C 2 p. E) MIL handbook gives curves/rules for different element types to compute factors, – b based on ambient temperature QA and electrical stress S – p. E based on environmental conditions – p. Q based on production quality and burn-in period – p. A based on component characteristics and usage in application – C 1 based on the complexity – C 2 based on the number of pins and the type of packaging – p. T based on chip temperature QJ and technology – p. V based on voltage stress Example: b usually grows exponentially with temperature QA (Arrhenius law) Industrial Automation Dependability – Evaluation 9. 2 - 13

What can go wrong… poor soldering (manufacturing)… broken wire… (vibrations) broken isolation (assembly…) Industrial Automation chip cracking (thermal stress…) tin whiskers (lead-free soldering) Dependability – Evaluation 9. 2 - 14

Failures that affect logic circuits Thermal stress (different dilatation coefficients, contact creeping) Electrical stress (electromagnetic fields) Radiation stress (high-energy particles, cosmic rays in the high atmosphere) Errors that are transient in nature (called “soft-errors”) can be latched in memory and become firm errors. “Solid errors” will not disappear at restart. E. g. FPGA with 3 M gates, exposed to 9. 3 108 neutrons/cm 2 exhibited 320 FIT at sea level and 150’ 000 FIT at 20 km altitude (see: http: \www. actel. com/products/rescenter/ser/index. html) Things are getting worse with smaller integrated circuit geometries ! Industrial Automation Dependability – Evaluation 9. 2 - 15

Exercise: Failure Modeling – Weibull Analysis The development of (t) towards the end of the lifetime of a component is usually described by a Weibull distribution: (t) = t – 1 with > 0. a) Draw the functions for the parameters = 1, 2, 3 in a common coordinate system. b) Compute the reliability function R(t) from (t). c) Draw the reliability functions for the parameters = 1, 2, 3 in a common coordinate system. d) Compare the wearout behavior with the behavior assuming constant failure rates (t) = . e) Draw the function for the parameters = 0. 5, 1 and 3. Compare with a bathtube curve. Industrial Automation Dependability – Evaluation 9. 2 - 16

Cold, Warm and Hot redundancy: the reserve element is fully operational and under stress, it has the same failure rate as the operating element. Warm redundancy: the reserve element can take over in a short time, it is not operational and has a smaller failure rate. Cold redundancy (cold standby): the reserve is switched off and has zero failure rate R(t) reliability of redundant 1 element 0 reliability of reserve element failure of primary element switchover t R(t) 1 Industrial Automation 0 t Dependability – Evaluation 9. 2 - 17

9. 2. 2 Reliability of series and parallel systems (combinatorial) 9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov models 9. 2. 5 Availability evaluation 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 18

Reliability of a system of unreliable elements 1 2 3 4 The reliability of a system consisting of n elements, each of which is necessary for the function of the system, whereby the elements fail independently is: n R total = R 1 * R 2 *. . * Rn = P (Ri) I=1 Assuming a constant failure rate allows to calculate easily the failure rate of a system by summing the failure rates of the individual components. R Noo. N = e - i t This is the base for the calculation of the failure rate of systems (MIL-STD-217 F) Industrial Automation Dependability – Evaluation 9. 2 - 19

Example: series system, combinatorial solution controller inverter / power supply control = 0. 00005 h-1 supply = 0. 001 h-1 encoder motor = 0. 0001 h-1 power supply motor+encoder controller Rtot = Rsupply * Rmotor * Rcontrol = e - supply t * e - motor t * e - control t = e -( supply + motor + control) t total= supply + motor + control = 0. 00115 h-1 Warning: This calculation does not apply any more for redundant system ! Industrial Automation Dependability – Evaluation 9. 2 - 20

Exercise: Reliability estimation An electronic circuit consists of the following elements: 1 processor MTTF= 600 years 48 pins 30 resistors MTTF= 100’ 000 years 2 pins 6 plastic capacitors MTTF= 50’ 000 years 2 pins 1 FPGA MTTF= 300 years 24 pins 2 tantal capacitors MTTF= 10’ 000 years 2 pins 1 quartz MTTF= 20’ 000 years 2 pins 1 connector MTTF= 5000 years 16 pins the MTTF of one solder point (pin) is 200’ 000 years What is the expected Mean Time To Fail of this system ? Repair of this circuit takes 10 hours, replacing it by a spare takes 1 hour. What is the availability in both cases ? The machine where it is used costs 100 € per hour, 24 hours/24 production, 30 years installation lifetime. What should the price of the spare be ? Industrial Automation Dependability – Evaluation 9. 2 - 21

Exercise: MTTF calculation An embedded controller consists of: - one microprocessor 486 - 2 x 4 MB RAM - 1 x Flash EPROM - 50 dry capacitors - 5 electrolytic capacitors - 200 resistors - 1000 soldering points - 1 battery for the real-time-clock what is the MTTF of the controller and what is its weakest point ? (use the numbers of a previous slide) Industrial Automation Dependability – Evaluation 9. 2 - 22

Redundant, parallel system 1 -out-of-2 with no repair - combinatorial solution simple redundant system: the system is good if any (or both) are good R 1 ok ok R 2 R 1 good R 2 good 1 -R 1 good R 2 down R 1 down R 2 good R 1 oo 2 = R 1 R 2 + R 1 (1 -R 2) + (1 -R 1) R 2 R 1 oo 2 = 1 - (1 -R 2)(1 -R 1) R 2 with R 1 = R 2 = R: R 1 oo 2 = 2 R - R 2 1 -R 2 with R = e - t R 1 oo 2 = 2 e - t - e -2 t Industrial Automation Dependability – Evaluation 9. 2 - 23

Combinatorial: R 1 oo 2, no repair Example R 1 oo 2: airplane with two motors MTTF of one motor = 1000 hours (this value is rather pessimistic) Flight duration, t = 2 hours - what is the probability that any motor fails ? - what is the probability that both motors did not fail until time t (landing)? apply: R 1 oo 1 = e - t single motor doesn't fail: 0. 998 (0. 2 % chance it fails) R 2 oo 2 = e -2 t no motor failure: 0. 996 (0. 4 % chance it fails) R 1 oo 2 = 2 e - t - e -2 t both motors fail: 0. 0004 % chance assuming there is no common mode of failure (bad fuel or oil, hail, birds, …) Industrial Automation Dependability – Evaluation 9. 2 - 24

R(t) for 1 oo 2 redundancy = 1 1. 000 0. 800 1 oo 2 R 0. 600 0. 400 0. 200 1 oo 1 Industrial Automation 2 8 1. 6 4 1. MTTF 1. 2 1. 1 8 0. 6 0. 4 0. 2 t [MTTF] 0. 0 0. 000 Dependability – Evaluation 9. 2 - 25

MIF, ARL, reliability of redundant structures ARL: Acceptable Reliability Level 1, 0 with redundancy ARL R 2 R 1 simplex MT 1 MT 2 MIF: Mission Time Improvement Factor (for given ARL) MIF = MT 2/MT 1 RIF: Reliability Improvement Factor (at given Mission Time) RIF = (1 -Rwith) / (1 -Rwithout) = quotient of unreliability Industrial Automation time Dependability – Evaluation 9. 2 - 26

R 1 oo 2 Reliability Improvement Factor 10 hours = 0. 001 1 0. 8 1 oo 2 0. 6 Reliability improvement factor (RIF) = (1 -Rwith) / (1 -Rwithout) RIF for 10 hours mission: R 1 oo 1 = 0. 990; R 1 oo 2 = 0. 999901 RIF = 100 0. 4 1 oo 1 0. 2 but: ∞ 0 MTTF 1 oo 2 = (2 e - t - e -2 t) dt = 0 3 2 no spectacular increase in MTTF ! � 1 oo 2 without repair is only suited when mission time << 1/ Industrial Automation Dependability – Evaluation 9. 2 - 27

Combinatorial: 2 out of three system E. g. three computers, majority voting R 1 R 2 R 3 R 2 R 1 R 3 R 1 good R 2 good R 3 good 2/3 work fail ok ok ok R 1 bad R 2 good R 3 good R 1 good R 2 bad R 3 good R 1 good R 2 good R 3 bad R 2 oo 3 = R 1 R 2 R 3 + (1 -R 1)R 2 R 3 + R 1(1 -R 2)R 3 + R 1 R 2(1 - R 3) with identical elements: R 1=R 2=R 3= R R 2 oo 3 = 3 R 2 -2 R 3 with R = e - t R 2 oo 3 = 3 e -2 t - 2 e -3 t Industrial Automation Dependability – Evaluation 9. 2 - 28

2 out of 3 without repair - combinatorial solution R 1 R 2 R 3 R 2 oo 3 = 3 R 2 - 2 R 3 = 3 e -2 t - 2 e -3 t ∞ 2/3 (3 e -2 t - 2 e -3 t) dt MTTF 2 oo 3 = 0 RIF < 1 when t > 0. 7 MTTF ! = 5 6 1 0. 8 2003 without repair is not interesting for long mission 0. 6 1 oo 1 1 oo 2 0. 4 0. 2 2 oo 3 0 Industrial Automation Dependability – Evaluation 9. 2 - 29

General case: k out of N Redundancy (1) K-out-of-N computer (Koo. N) • N units perform the function in parallel • K fault-free units are necessary to achieve a correct result • N – K units are “reserve” units, but can also participate in the function E. g. : • aircraft with 8 engines: 6 are needed to accomplish the mission. • voting in computers: If the output is obtained by voting among all N units N £ 2 K – 1 worst-case assumption: all faulty units fail in same way Industrial Automation Dependability – Evaluation 9. 2 - 30

What is better ? 4 motors, three of which are sufficient to accomplish the mission (fly 21 days, MTTF = 10'000 h per motor) 12 motors, 8 of which are sufficient to accomplish the mission (fly 21 days, MTTF = 5'000 h per motor) Industrial Automation Dependability – Evaluation 9. 2 - 31

General case: k out of N redundancy (2) Example with N = 4 R 3 R 2 R 1 one of N fail no fail two of N fail K of N fail all fail N N-1 + ( ) (1 -R)2 RN-2 +. . . + ( ) (1 -R)KRN-K +. . + (1 -R)N = 1 RKoo. N = RN + ( ) (1 -R) R K 1 2 N of N N + (N-1) + (N-2) of N K RKoo. N = i = 0 Industrial Automation N ( ) (1 – R)i RN-i i Dependability – Evaluation 9. 2 - 32

Comparison chart 1. 000 1 oo 4 1 oo 1 2 oo 4 0. 400 1 oo 2 3 oo 4 2 8 1 oo 1 1. 6 1. 2 1. 8 0. 4 0. 2 0. 000 6 8 oo 12 4 2 oo 3 1. 0. 200 1 0. 600 0 R 0. 800 t Industrial Automation Dependability – Evaluation 9. 2 - 33

What does cross redundancy brings ? Reliability chain controller network separate: double fault brings system down controller network but cross-coupling needs a switchover logic – availability sinks again. UL controller Industrial Automation cross-coupling – better in principle since some double faults can be outlived network Dependability – Evaluation 9. 2 - 34

Summary Assumes: all units have identical failure rates and comparison/voting hardware does not fail. 1 oo 2 (duplication and error detection) 1 oo 1 (non redundant) R R 1 oo 1 = R 2 oo 3 (triplication and voting) R R 1 oo 2 = 2 R – R 2 R R R 2 oo 3 = 3 R 2 – 2 R 3 koo. N (k out of N must work) K RKoo. N = i = 0 Industrial Automation N Ri (1 – R)N-i ( ) i Dependability – Evaluation 9. 2 - 35

Exercise: 2 oo 3 considering voter unreliability Compute the MTTF of the following 2 -out-of-3 system with the component failure rates: –redundant units 1 = 0. 1 h-1 –voter unit 2 = 0. 001 h-1 input R 1 R 1 2/3 R 2 Industrial Automation output Dependability – Evaluation 9. 2 - 36

Complex systems R 2 R 3 R 7 R 5 R 1 R 2 R 3 R 8 R 6 R 9 R 7 R 8 Reliability is dominated by the non-redundant parts, in a first approximation, forget the redundant parts. Industrial Automation Dependability – Evaluation 9. 2 - 37

Exercise: Reliability of Fault-Tolerant Structures Assume that all units in the sequel have a constant failure rate . Compute the reliability functions (and MTTF) for the following structures a) non-redundant b) 1/2 system c) 2/3 system assuming perfect ( p = 0) voters, error detection, reconfiguration circuits etc. d) Draw all functions in a common coordinate system. e) For a railway signalling system, which structure is preferable? f) Is the answer different for a space application with a given mission time? Why? Industrial Automation Dependability – Evaluation 9. 2 - 38

9. 2. 3 Considering repair 9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov Processes 9. 2. 5 Availability evaluation 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 39

Repair Fault-tolerance does not improve reliability under all circumstances. It is a solution for short mission duration Solution: repair (preventive maintenance, off-line repair, on-line repair) Example: short Mission time, high MTTF: pilot, co-pilot long Mission time, low MTTF: how to reach the stars ? (hibernation, reproduction in space) Problem: exchange of faulty parts during operation (safety !) reintegration of new parts, teaching and synchronization Industrial Automation Dependability – Evaluation 9. 2 - 40

Preventive maintenance R(t) 1 MTBPM Mean Time between preventive maintenance Preventive maintenance reduces the probability of failure, but does not prevent it. in systems with wear, preventive maintenance prevents aging (e. g. replace oil, filters) Preventive maintenance is a regenerative process (maintained parts as good as new) Industrial Automation Dependability – Evaluation 9. 2 - 41

Considering Repair beyond combinatorial reliability, more suitable tools are required. the basic tool is the Markov Chain (or Markov Process) Industrial Automation Dependability – Evaluation 9. 2 - 42

9. 2. 4 Markov models 9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov models 9. 2. 5 Availability evaluation 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 43

Markov Describe system through states, with transitions depending on fault-relevant events States must be – mutually exclusive – collectively exhaustive Let pi (t) = Probability of being in state Si at time t -> ∑ pi(t) = 1 all states The probability of leaving that state depends only on current state (is independent of how much time was spent in state or how state was reached) Example: protection failure normal OK lightning strikes (not dangerous) µ repair protection not working PD DG lightning strikes danger what is the probability that protection is down when lightning strikes ? Industrial Automation Dependability – Evaluation 9. 2 - 44

Continuous Markov Chains State 1 P 1 State 2 P 2 µ Time is considered continuous. Instead of transition probabilities, the temporal behavior is given by transition rates (i. e. transition probabilities per infinitesimal time step). A system will remain in the same state unless going to a different state. Relationship between state probabilities are modeled by differential equations, e. g. d. P 1/dt = µ P 2 – P 1, d. P 2/dt = P 1 – µ P 2 for any state: inflow outflow dpi(t) = ∑ k pk(t) - ∑ i pi(t) dt Industrial Automation Dependability – Evaluation 9. 2 - 45

Markov Chain Simplification Rules (1) λ 1 Parallel Transitions B A Λ 1 + λ 2 A B λ 2 • • Intermediate States λ 1 A λ 2 B C The states have the same outgoing events leading to the same state(s). No other incoming/outgoing exist. λ 4 E A λ 1+λ 2+λ 3 F λ 4 E λ 4 λ 3 D Industrial Automation Dependability – Evaluation 9. 2 - 46

Markov Chain Simplification Rules (2) Side Step Events λ 1 A B λ 2 Industrial Automation C A λ 2 C Dependability – Evaluation 9. 2 - 47

Markov - hydraulic analogy 12 42 P 1 P 4 P 3 32 µ Output flow = probability of being in a state P output rate of state from other states 42 12 p 1(t) µ i State S 1 pump 32 p 2(t) State S 2 µ p 2(t) Simplification: output rate j = constant (not a critical simplification) Industrial Automation Dependability – Evaluation 9. 2 - 48

Reliability expressed as state transition one element: good P 0 arbitrary transitions: (t) fail P 1 good dp 0 = - p 0 dt dp 1 = + p 0 dt R(t) = p 0(t) = e - t R(t=0) = 1 fail down fail 1 all ok up 2 non-terminal states Industrial Automation R(t) = 1 - (pfail 1+ pfail 2 ) up 1 fail 2 terminal states Dependability – Evaluation 9. 2 - 49

Reliability and Availability expressed in Markov good Reliability Availability (t) failure rate λ bad up failure rate down repair rate µ state MTTF good bad time definition: "probability that an item will perform its required function in the specified manner and under specified or assumed conditions over a given time period" Industrial Automation up down repair up up MDT time definition: "probability that an item will perform its required function in the specified manner and under specified or assumed conditions at a given time " Dependability – Evaluation 9. 2 - 50

reliable systems have absorbing states, they may include repair, but eventually, they will fail Industrial Automation Dependability – Evaluation 9. 2 - 51

Redundancy calculation with Markov: 1 out of 2 (no repair) good Markov: 2 P 0 P 1 fail = constant P 2 What is the probability that system be in state S 0 or S 1 until time t ? Linear Differential Equation Solution: dp 0 = - 2 p 0 dt dp 1 = + 2 p 0 - p 1 dt dp 2 = + p 1 dt initial conditions: p 0 (0) = 1 (initially good) p 1 (0) = 0 p 2 (0) = 0 p 0 (t) = e -2 t p 1 (t) = 2 e - t - 2 e -2 t R(t) = p 0 (t) + p 1 (t) = 2 e - t - e -2 t Industrial Automation (same result as combinatorial - QED) Dependability – Evaluation 9. 2 - 52

Reliable 1 -out-of-2 with on-line repair (1 oo 2) S 1: on-line unit failed good n b P 1 µn P 0 µb b back-up also fails fail on-line unit fails is equivalent to: P 0 P 12 with n = b ; n = b dp 2 = + p 0 - ( + ) p 2 dp 3 = + p 1 + p 2 dt dp 0 = - 2 p 0 + p 1+2 dt 2 dt dt n S 2: back-up unit failed dt dp 1 = + p 0 - ( + ) p 1 P 3 P 2 dp 0 = - 2 p 0 + p 1 + p 2 P 3 fail dp 1+2 = + 2 p 0 - ( + ) p 1+2 dt dp 3 = + (p 1+p 2) dt it is easier to model with a repair team for each failed unit (no serialization of repair) Industrial Automation Dependability – Evaluation 9. 2 - 53

Reliable 1 -out-of-2 with on-line repair (1 oo 2) What is the probability that a system fails while one failed element awaits repair ? failure rate Markov: 2 P 0 P 1 absorbing state P 2 repair rate Linear Differential Equations: dp 0 = - 2 p 0 + p 1 dt initial conditions: p 0 (0) = 1 (initially good) dp 1 = + 2 p 0 - ( + ) p 1 (0) = 0 dp 2 = + p 1 p 2 (0) = 0 dt dt Ultimately , the absorbing states will be “filled”, the non-absorbing will be “empty”. Industrial Automation Dependability – Evaluation 9. 2 - 54

Results: reliability R(t) of 1 oo 2 with repair rate µ (3 +µ)+W e -(3 +µ-W) t (3 +µ)-W e -(3 +µ+W) t R(t) = P 0+ P 1 = 2 W 2 W with: W = 2 + 6 µ + µ 2 = 0. 01 we do not consider short mission time 1 = 10 h-1 0. 8 0. 6 = 1. 0 h-1 1 oo 2 no repair does not interrupt mission 0. 4 = 0. 1 h-1 0. 2 0 Time in hours R(t) accurate, but not very helpful - MTTF is a better index for long mission time Industrial Automation Dependability – Evaluation 9. 2 - 55

Mean Time To Fail (MTTF) non-absorbing states i P 0 P 1 P 3 P 2 P 4 absorbing states j R(t) non-absorbing states i 1. 0000 0. 8000 0. 6000 MTTF = pi(t) dt 0. 4000 0 0. 2000 0. 0000 0 Industrial Automation 2 4 6 8 10 12 14 time Dependability – Evaluation 9. 2 - 56

MTTF calculation in Laplace (example 1 oo 2) + P 1(s) s. P 0 (s) - p 0(t=0) = - 2 P 0 (s) Laplace transform initial conditions: p 0 (t=0) = 1 (initially good) apply boundary theorem s. P 1(s) - 0 = + 2 P 0(s) - ( + ) P 1(s) + P 1(s) s. P 2(s) - 0 = lim t p(t) dt = lim s P(s) s 0 0 only include non-absorbing states (number of equations = number of non-absorbing states) -1 = - 2 P 0 0 = + 2 P 0 solution of linear equation system: MTTF = P 0 + P 1 = ( + ) + Industrial Automation 2 2 + P 1 - ( + )P 1 1 = / + 3 2 Dependability – Evaluation 9. 2 - 57

General equation for calculating MTTF 1) Set up differential equations 2) Identify terminal states (absorbing) 3) Set up Laplace transform for the non-absorbing states 1 0 0. . = M Pna the degree of the equation is equal to the number of non-absorbing states 4) Solve the linear equation system 5) The MTTF of the system is equal to the sum of the non-absorbing state integrals. 6) To compute the probability of not entering a certain state, assign a dummy (very low) repair rate to all other absorbing states and recalculate the matrix Industrial Automation Dependability – Evaluation 9. 2 - 58

Example 1 oo 2 control computer in standy input on-line λw repair rate µ same for both stand-by E D error detection (also of idle parts) idle λs E D output coverage = c Industrial Automation Dependability – Evaluation 9. 2 - 59

Correct diagram for 1 oo 2 Consider that the failure rate of a device in a 1 oo 2 system is divided into two failure rates: 1) a benign failure, immediately discovered with probability c - if device is on-line, switchover to the stand-by device is successful and repair called - if device is on stand-by, repair is called 2) a malicious failure, which is not discovered, with probability (1 -c) - if device is on-line, switchover to the standby device fails, the system fails - if device is on stand-by, switchover will be unsuccessful should the online device fail P 0 ( w+ s) c s (1 -c) w (1 -c) P 1 P 2 s P 3 w (absorbing state) 1 = - 2 P 0 + P 1 0 = + 2 c P 0 - ( + )P 1 0 = + (1 -c) P 0 - P 2 Industrial Automation 1: on-line fails, fault detected (successful switchover and repair) or standby fails, fault detected, successful repair 2: standby fails, fault not detected 3: both fail, system down MTTF = (2+c) + / (2 -c) 2 ( + (1 -c) ) Dependability – Evaluation 9. 2 - 60

Approximation found in the literature This simplified diagram considers that the undetected failure of the spare causes immediately a system failure simplified when w = s = 2 (1 -c) P 0 2 c P 1 absorbing state applying Markov: -1 = - 2 P 0 + P 1 0 = + 2 c P 0 - ( + )P 1 P 3 +P 2 MTTF = 0 = + 2 (1 -c) P 0 + P 1 (1+2 c) + / 2 ( + (1 -c) ) The results are nearly the same as with the previous four-state model, showing that the state 2 has a very short duration … Industrial Automation Dependability – Evaluation 9. 2 - 61

MTTF (c) Influence of coverage (2) Example: = 10 -5 h-1 (MTTF = 11. 4 year), µ = 1 hour-1 MTTF with perfect coverage = 570468 years 600000 500000 400000 When coverage falls below 60%, the redundant (1 oo 2) system performs no better than a simplex one ! 300000 200000 Therefore, coverage is a critical success factor for redundant systems ! 100000 In particular, redundancy is useless if failure of the spare remains undetected (lurking error). 0 1. 00 0. 00 99 00 0. 99 99 99 0. 99 99 9 0. 99 00 0. 90 9 0 0. 900 0 90 0 0. 00 0 9 0 0. 000 0 6 0 0. 000 0 00 coverage Industrial Automation 1 lim MTTF = 0 ( 3 2 + 2 ) 1 lim MTTF = / 0 (1 -c) Dependability – Evaluation 9. 2 - 62

Application: 1 oo 2 for drive-by-wire x coverage is assumed to be the probability that self-check detects an error in the controller. when self-check detects an error, it passivates the controller (output is disconnected) and the other controller takes control a 1 selfcheck control a 2 one assumes that an accident occurs if both controllers act differently, i. e. if a computer does not fail to silent behaviour. Self-check is not instantaneous, and there is a probability that the self-check logic is not operational, and fails in underfunction (overfunction is an availability issue) Industrial Automation Dependability – Evaluation 9. 2 - 63

Results 1 oo 2 c, applied to drive-by-wire = reliability of one chain (sensor to brake) = 10 -5 h-1 (MTTF = 10 years) c = coverage: variable (expressed as uncoverage: 3 nines = 99. 9 % detected) µ = repair rate = parameter - 1 Second: reboot and restart - 6 Minutes: go to side and stop - 30 Minutes: go to next garage log (MTTF) 16. 00 1 second 14. 00 6 minutes 12. 00 10. 00 1 Mio years or once per year on a million vehicles 30 minutes 8. 00 6. 00 0. 1% undetected 4. 00 conclusion: the repair interval does not matter when coverage is poor Industrial Automation 2. 00 0. 00 1 poor 2 3 4 5 6 7 uncoverage 8 9 10 excellent Dependability – Evaluation 9. 2 - 64

Protection system (general) In protection systems, the dangerous situation occurs when the plant is threatened (e. g. short circuit) and the protection device is unable to respond. The threat is a stochastic event, therefore it can be treated as a failure event. protection failure normal OK PD µ s threat to plant (not dangerous) s DG protection down (detection and repair) threat to plant danger The repair rate µ includes the detection time t ! This impacts directly the maintenance rate. What is an acceptable repair interval ? Note: another way to express the reliability of a protection system will be shown under “availability” Industrial Automation Dependability – Evaluation 9. 2 - 65

Protection system: how to compute test intervals 1 = overfunction of protection Plant down Single fault 2 = lurking overfunction 3 = lurking underfunction protection P 1 failed by = plant suffers attack immediate overfunction µ 1 t = test rate (e. g. 1/6 months) = repair rate (e. g. 1/8 hours) Normal P 0 repaired µ unavailable states P 3 test rate repaired 3 µ repaired lurking overfunction (unwanted trip at next attack) plant threat 2 Plant down Double fault P 2 µ P 4 plant threat protection failed P 6 by underfunction (fail-to-trip) t detected error P 5 t test rate lurking underfunction 2 (unlikely) Danger since there exist back-up protection systems, utilities are more concerned by non-productive states Industrial Automation Dependability – Evaluation 9. 2 - 66

9. 2. 5 Availability evaluation 9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov models 9. 2. 5 Availability evaluation 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 67

Availability up up down up up up down Availability expresses how often a piece of repairable equipment is functioning it depends on failure rate and repair rate µ. Punctual or point availability = probability that the system working at time t (not relevant for most processes). Stationary availability = duty cycle (Percentage of time spent in up state) (impacts financial results) MTTF ∑ up times A = availability = lim = t ∑ (up times + down times) MTTF + MTTR Unavailability is the complement of availability (U = 1, 0 – A) as convenient expression. (e. g. 5 minutes downtime per year = availability is 0. 999%) Industrial Automation Dependability – Evaluation 9. 2 - 68

Assumption behind the model: renewable system R(t) £ A(t) due to repair or preventive maintenance (exchange parts that did not yet fail) after repair, as new A(t) 1 0 Stationary availability A = t MTTF over the lifetime MTTF + MTTR Industrial Automation Dependability – Evaluation 9. 2 - 69

Examples of availability requirements substation automation telecom power supply Industrial Automation > 99, 95% 5 * 10 -7 4 hours per year 15 seconds per year Dependability – Evaluation 9. 2 - 70

Availability expressed in Markov states up states i P 0 Availability = down up pi(t = ) Industrial Automation P 1 P 3 P 2 P 4 Unavailability = down states j (non-absorbing) pj (t = oo) Dependability – Evaluation 9. 2 - 71

Availability of repairable system Markov states: P 1 P 0 down state (but not absorbing) dp 0 = - p 0 + p 1 dt dp 1 = + p 0 - p 1 dt 1 A = 1 + e. g. = lim stationary state: dp 0 = dp 1 = 0 t ∞ dt dt due to linear dependency add condition: p 0 + p 1 = 1 1 unavailability U = (1 - A) = 1 + µ/ MTBF = 100 Y -> = 1 / (100 * 8765) h-1 MTTR = 72 h -> = 1/ 72 h-1 Industrial Automation -> A = 99. 991 % -> U = 43 mn / year Dependability – Evaluation 9. 2 - 72

Example: Availability of 1 oo 2 (1 out-of-2) Markov states: 2 P 0 P 1 down state (but not absorbing) P 2 2 assumption: devices can be repaired independently (little impact when << µ) dp 0 = - 2 p 0 + p 1 dt dp 1 = + 2 p 0 - ( + ) p 1 + 2 p 2 dt dp 2 = + p 1 - 2 p 2 dt 1 A = 1 + e. g. = lim stationary state: dp 0 = dp 1 = dp 2 = 0 t ∞ dt dt dt due to linear dependency add condition: p 0 + p 1 + p 2 = 1 unavailability U = (1 - A) = 2 2 lim U<<1 2 ( / )2 + 2(µ/ ) 2 + 2 µ MTBF = 100 Y -> = 1 / (100 * 8765) h-1 MTTR = 72 h -> = 1/ 72 h-1 Industrial Automation -> A = 99. 9999993 % -> U = 0. 2 s / year Dependability – Evaluation 9. 2 - 73

Availability calculation 1) Set up differential equations for all states 2) Identify up and down states (no absorbing states allowed !) 3) Remove one state equation save one (arbitrary, for numerical reasons take unlikely state) 4) Add as first equation the precondition: 1 = ∑ p (all states) 1 0 0. . = M Pall 5) The degree of the equation is equal to the number of states 6) Solve the linear equation system, yielding the % of time each state is visited 7) The unavailability is equal to the sum of the down states We do not use Laplace for calculating the availability ! Industrial Automation Dependability – Evaluation 9. 2 - 74

1 oo 2 including coverage 2 (1 -c) Markov states: 2 c P 0 P 1 down state (but not absorbing) P 2 2 assumption: devices can be repaired independently (little impact when << µ) dp 0 = - 2 p 0 + p 1 dt lim stationary state: dp 0 = dp 1 = dp 2 = 0 t ∞ dt dt dt dp 1 = + 2 c p 0 - ( + ) p 1 + 2 p 2 dt due to linear dependency add condition: p 0 + p 1 + p 2 = 1 dp 2 = + 2 (1 -c) p 0 + p 1 - 2 p 2 dt 1 A = 1 + unavailability U = (1 - A) = 2 2 lim / >> 1 2 ( / )2 + 2( / ) 2 + 2 µ Industrial Automation Dependability – Evaluation 9. 2 - 75

Exercise A repairable system has a constant failure rate = 10 -4 / h. Its mean time to repair (MTTR) is one hour. a) Compute the mean time to failure (MTTF). b) Compute the MTBF and compare with the MTTF. c) Compute the stationary availability. Assume that the unavailability has to be halved. How can this be achieved d) by only changing the repair time? e) by only changing the failure rate? f) Make a drawing that shows how a varying repair time influences availability. Industrial Automation Dependability – Evaluation 9. 2 - 76

9. 2. 6 Examples 9. 2. 1 Reliability definitions 9. 2. 2 Reliability of series and parallel systems 9. 2. 3 Considering repair 9. 2. 4 Markov models 9. 2. 5 Availability evaluation with Markov 9. 2. 6 Examples Industrial Automation Dependability – Evaluation 9. 2 - 77

Exercise: Markov diagram 1 µ 1 0 b b 1 µ 2 2 n 3 n 4 Is this a reliable or an available system ? Set up the differential equations for this Markov model. Compute the probability of not reaching state 4 (set up equations) Industrial Automation Dependability – Evaluation 9. 2 - 78

Case study: Swiss Locomotive 460 control system availability normal reserve member N member R MVB I/O system Assumption: each unit has a back-up unit which is switched on when the on-line unit fails The error detection coverage c of each unit is imperfect The switchover is not always bumpless - when the back-up unit is not correctly actualized, the main switch trips and the locomotive is stuck on the track What is the probability of the locomotive to be stuck on track ? Industrial Automation Dependability – Evaluation 9. 2 - 79

Markov model: SBB Locomotive 460 availability bumpless takeover all OK (1 - - ) P 0 c µ member N failure detected (1 -c) train stop and reboot µ member R failure detected takeover unsuccessful member N fails member R fails undetected probability that member N or member R fails mean time to repair for member N or member P c probability of detected failure (coverage factor) probability of bumpless recovery (train continues) probability of unsuccessful recovery (train stuck) time to reboot and restart train periodic maintenance check Industrial Automation member R on-line member R fails stuck on track member N fails µ = 10 -4 = 0. 1 c = 0. 9 = 0. 01 = 10 = 1/8765 (MTTF is 10000 hours or 1, 2 years) (repair takes 10 hours, including travel to the works) (probability is 9 out of 10 errors are detected) (probability is that 9 out of 10 take-over is successful) (probability is 1 failure in 100 cannot be recovered) (mean time to reboot and restart train is 6 minutes) (mean time to periodic maintenance is one year). Dependability – Evaluation 9. 2 - 80

SBB Locomotive 460 results. How the down-time is shared: unsuccessful recovery 7% Stuck: 2 nd failure before maintenance Under these conditions: unavailability will be 0. 5 hours a year. stuck on track is once every 20 years. recovery will be successful 97% of the time. 32% 61% Stuck: 2 nd failure before repair Stuck: after reboot 0. 0009% 0. 00045% OK after reboot recommendation: increase coverage by using alternatively members N and R (at least every start-up) Industrial Automation Dependability – Evaluation 9. 2 - 81

Example protection device Protection device current sensor circuit breaker Industrial Automation Dependability – Evaluation 9. 2 - 82

Probability to Fail on Demand for safety (protection) system IEC 61508 characterizes a protection device by its Probability to Fail on Demand (PFD): PFD = (1 - availability of the non-faulty system) (State 0) underfunction good P 0 (1 -u) overfunction R u P 1 u = probability of underfunction P 4 R plant damaged P 3 plant down Industrial Automation Dependability – Evaluation 9. 2 - 83

Protection system with error detection (self-test) 1 oo 1 (1 -u) R uc P 3 : protection failure danger overfunction u: probability of underfunction [IEC 61508: 50%] C: coverage, probability of failure detection by self-check P 1 P 0 P 4 u(1 -c) T P 2 P 1: protection failed in underfunction, failure detected by self-check (instantaneous), repaired with rate µR = 1/MRT P 2: protection failed in underfunction, failure detected by periodic check with rate µT = 2/Test. Period P 3: protection failed in overfunction, plant down P 4: system threatened, protection inactive, danger normal PFD = 1 - P 0 = 1 - with: 1 1 + u (1 -c) + u c µR µT = 10 -7 h-1 MTTR = 8 hours -> µR =0. 125 h-1 Test Period = 3 months -> µT =2/4380 coverage = 90% Industrial Automation ≈ u ( (1 -c) + µT c µR ) PFD = 1. 1 10 -5 for S 1 and S 2 to have same probability: c = 99. 8% ! Dependability – Evaluation 9. 2 - 84

Example: Protection System tripping algorithm 1 inputs & trip signal under tripping algorithm 1 comparison tripping algorithm 2 Industrial Automation over underfunctions increased P = 2 Pu - Pu 2 tripping algorithm 2 inputs overfunctions reduced P = Po 2 trip signal & dynamic modeling necessary repair Dependability – Evaluation 9. 2 - 85

Markov Model for a protection system 2(1 -c) ( 1+ 2)(1 -c) latent underfunction 2 chains, n. detectable latent overfunction 1 chain, n. detectable ( 1+ 2)c+ 3 1+ 1(1 -c) ( 1+ 2+ 3)c OK detectable error 1 chain, repair 1+ 2+ 3 c 3(1 -c) latent underfunction not detectable 1=0. 01, 2= 3=0. 025, 1=5, 2=1, =365, Industrial Automation 1(1 -c) overfunction 2 2 underfunction c=0. 9 [1/ Y] Dependability – Evaluation 9. 2 - 86

Analysis Results mean time to underfunction [Y] 400 permanent comparison (SW) assumption: SW error-free weekly test 300 permanent comparison (red. HW) 200 2 -yearly test 50 Industrial Automation 5000 mean time to overfunction [Y] Dependability – Evaluation 9. 2 - 87

Example: CIGRE model of protection device with self-check PLANT DOWN SINGLE FAULT S 8 S 6 S 2 µ self-check overfunction 1 PLANT DOWN DOUBLE FAULT µ 1 1 1 (1 -c) 2 S 10 µ e 2 2 c 2 (1 -c) S 1 S 9 d. T d. M 1 c S 5 d. T d. M S 11 S 3 2 self-check underfunction S 4 3 (1 -c) 3 c e 1 µ P 1 3 2 d. M 2 µ S 7 DANGER 2 P 8, P 9: error detection failed Industrial Automation P 10, P 11: failure detectable by self-check P 4, P 3: failure detectable by inspection Dependability – Evaluation 9. 2 - 88

Summary: difference reliability - availability Reliability down all ok up up Availability fail good look for: Mean Time To Fail (integral over time of all non-absorbing states) set up linear equation with s = 0, initial conditions S(T = 0) =1. 0 solve linear equation Industrial Automation down all ok up up fail down fail up look for: stationary availability A (t = ∞) (duty cycle in UP states) set up differential equation (no absorbing states!) initial condition is irrelevant solve stationary case with ∑p = 1 Dependability – Evaluation 9. 2 - 89

Exercise: set up the Markov model for this system A brake can fail open or fail close. A car is unable to brake if both brakes fail open. A car is unable to cruise if any of the brakes fail close. A fail open brake is detected at the next service (rate ). There is an hydaulic and an electric brake. ce = 0. 9 ( 99% fail close) e = 10 -5 h-1 electric brake hydraulic brake : service every month h = 10 -6 h-1 ch =. 99 % fail close (. 01 fail open) Industrial Automation Dependability – Evaluation 9. 2 - 90

Industrial Automation Dependability – Evaluation 9. 2 - 91