System Reliability Resit Unal Engineering Management Systems Engineering

System Reliability Resit Unal Engineering Management & Systems Engineering Dept. Old Dominion University runal@odu. edu Slide 1

System Life Cycle Concepts NEED 1. CONCEPTUAL PRELIMINARY DESIGN Acquisition 2. DETAIL DESIGN & DEVELOPMENT 3. PRODUCTION/CONSTRUCTION Utilization 4. OPERATION & SUPPORT 5. PHASE OUT / DISPOSAL Slide 2

Lifecycle Costs (LCC) • 60 – 80% of LCC spent during operation phase • Reliability is major cost driver (Failures, repair, lost operation time, redesign. . ) • 70% of LCC committed during Design Phase Design fixes how system will be operated, maintained Slide 3

System Reliability • Engineering is concerned with how products/systems work, but also need to understand, • The ways in which they fail, effects of failures, & aspects of design which affect the likelihood of failure, • Reliability Engineering. Slide 4

Defining Reliability is the probability that a given system will perform as anticipated under given operating conditions. It can predict the probability that a system will operate for a specified # of hours or a certain average time between failures. Slide 5

Failure Patterns: Bath Tub Curve λ(t) Failure Rate Decreasing Constant (Random Failure) Increasing Failure time Burn-In Useful Life Wear-out Slide 6

Non-Repairable systems • The instantaneous probability of the first and only failure is called the failure rate. • Mean Time to Failure (MTTF) Slide 7

Repairable systems • Slide 8

Tasks of Reliability F(t) = P(t < T) R(t) = 1 - F(t) T = time to failure I) First task is to derive & study this equation II) Find the best way to increase Reliability Slide 9

Find Best Ways to Increase Reliability 1. Reduce complexity 2. Increase R of components/subsystems 3. Parallel redundancy 4. Stand-by redundancy 5. Preventive Maintenance 6. Repair 7. Combination Slide 10

Failure f(t): Exponential Distribution 1. Failures occur at random intervals. 2. Failure rate stays constant (time independent). f(t) EXPONENTIAL Constant Failure Rate (CFR) λ(t) ↓ λ λ(t)↑ t Slide 11

Exponential Distribution; CFR f(t) = λe-λt Failure Rate R(t) = e-λt λ F(t) = 1 – e-λt MTTF = 1/ λ time (time unit) Constant Failure Rate (time independent) Slide 12

Example Slide 13

Time Dependent Failure Distributions • Failure Rate Time Slide 14

Weibull Distribution Failure Rate Time Slide 15

Weibull Distribution • Example: Ball Bearing, Weibull Distribution m=4 θ = 100 Failure Rate 50 Hour Mission F(t) = 1 – e –(t/θ)m time F(50) = 0. 0606 R(50)=0. 9394 Slide 16

General System Reliability Models • 1. Series (non-redundant) system R 1 R 2 1 2 R 1 = e –λ 1 t R 2 = e –λ 2 t Rss = R 1. R 2 Slide 17

Series System 0. 85 R 1 R 2 Rss = (0. 85) = 0. 7225 For series systems, high reliability of components/subsystems are required. Slide 18

Parallel Reliability Model • Active Redundancy R 1 RPS = R 1 + R 2 – R 1 R 2 Reliability for parallel System: Slide 19

Parallel Reliability Model • Two Component System RPS = R 1 + R 2 - R 1 R 2 R 1 = R 2 = 0. 85 RPS = 0. 98 0. 85 R 1 R 2 0. 85 Active redundancy: Reliability increases Slide 20

m out–of–N Units System • Active redundancy. At least m units out of N must function for the system to operate normally. If identical, independent, → Binomial Distribution M Rm/N = ∑ (m. N) Rm (1 -R) N-m m m/N Active Redundancy Slide 21

m–out–of–N Units System Aircraft has 4 identical, independent engines with R = 0. 98 At least 2 engines must function (Active redundancy). N=4 • R 2/4 = ∑ ( m 4 ) (0. 98)m(1 - 0. 98)4 -m m=2 • R 2/4 = 0. 99996 Slide 22

Complex System Reliability Analysis Methods • • Network Reduction Approach Fault Tree Analysis FMEA Failure Modes and Effects Analysis FMECA Failure Modes, Effects and Criticality Analysis Slide 23

System Reliability Analysis Methods I. Network Reduction Approach Ex a 1 b 1 a 2 c a 3 R system b 2 a 4 RSYST = (2 Rb – R 2 b ) Rc Reliability Block Diagram Slide 24

Fault-Tree Analysis (FTA) • FTA – Top down approach (Bell Labs/Boeing) – Start with identifying an undesirable event TOP EVENT – Events that can lead to the Top-Event are described with Logic Operators (AND, OR, EOR. . ) Slide 25

FTA Logic Operators • AND Gate • OR Gate • AND Gate: Provides a True Out-Put if ALL inputs are True. A B AND A B 0 0 1 1 1 Slide 26

FTA Logic Operators • OR GATE : Provides a true output if one or more inputs are true A B OR 0 0 1 1 1 0 1 1 Slide 27

FTA Logic Operators n • FOR = 1 - ∏ (1 - Fi) i-1 n • FAND = ∏ Fi i-1 Slide 28

FTA Example System designed to deliver emergency cooling to a nuclear reactor. Protection system will not deliver a signal to pump & valve actuators (p of failure = 0. 0001) Pump will fail to start when the actuation signal is received (p = 0. 02) A valve will fail to open when the actuation signal is received (p = 0. 1) The reservoir will be empty at the time of the accident (p = 0. 00005) Slide 29

FTA: Emergency cooling to a nuclear reactor Slide 30

FTA: Emergency Cooling to a Nuclear Reactor Using FTA Analysis: • Probability of failure = 0. 000915 • Reliability = 0. 9991 Slide 31

FTA Use Advantages/Issues 1. 2. 3. 4. 5. One event at a time Provides insight into system behavior Top-down approach FTA can get complicated for large systems Difficult to handle degraded component states. Slide 32

FMEA = Failure Modes and Effects Analysis • Concerned with determining design R by considering potential failures and their effects on the system. • List each failure mode and effect on paper. • Bottom-Up Approach. Slide 33

FMEA • “Military Standards: Procedures for performing failure modes, effects and criticality analysis” (1980) • TYPICAL STEPS IN FMEA: 1. SYSTEM DEFINITION. Identify systems that may fail. Slide 34

FMEA 2. IDENTIFICATION OF FAILURE MODES. Ways components may fail: • • • Short Rupture Fracture Power Loss Out-of-Tolerance • Operational & Environmental Conditions should be listed. Slide 35

FMEA 3. DETERMINE CAUSE. – – – – Stress Contamination Evaporation Fatigue Wear-Out Corrosion Errors Slide 36

FMEA Documentation Failure Mode Cause Failure Mechanism Action Fracture Excessive Vibration Fatigue Redesign Mounts Slide 37

FMEA Documentation 4. ASSESSMENT OF THE EFFECT. (leakage, rupture) Failure Mode Cause Failure Mechanism Effect Brittle seal Sustained low temperature Leakage Critical Slide 38

FMEA 5. CLASSIFICATION OF SEVERITY I. III. IV. Catastrophic: Critical: Marginal: Negligible: Major damage/loss of life Mission may be lost System degraded Minor with no effect on perf. Slide 39

FMEA 6. PROBABILITY OF OCCURRENCE. Reliability testing, Failure Data, Expert Judgment When NO sufficient Data Exist: Military Standard: Procedures performing a FMECA (1980) Slide 40

FMEA • Slide 41

FMECA • • • List: Failure Modes Causes of failure Possible Effects Probability of Occurrence Criticality Possible Action FMECA Handbook of Reliability Engineering and Management Slide 42

FMECA Slide 43

FMEA/FMECA • Serve as each possible failure mode detection technique • All possible failure modes & effects on mission, people, & system can be identified • Provide useful input data in performing system safety and maintainability analysis • Systematic approach to classify hardware failures Slide 44

FMEA/FMECA • Provides input for development of built in test software and equipment • Can be used for design comparison studies • Provides improved communication • Procedure begins from detailed level and works upward. Slide 45

Failure Data Collection, Analysis • FAILURE DATA USES 1. 2. 3. 4. 5. 6. 7. 8. Compute Failure Rate Determine failure distribution Decisions on Redundancy Trade-off Studies Replacement Studies Preventative Maintenance Decisions Availability Design Changes Slide 46

Failure Data Collection, Analysis • LIFE TESTING – Time-to-failure (DOE Techniques) • FIELD DATA – # of Failures Slide 47

Identifying Failure Distribution We try to fit the data to a known distribution f(t) 1. Collect data 2. Hypothesize a distribution 3. Plot data on appropriate graph paper for this distribution 4. If there is a good fit: the data points will be clustered along a straight line 5. Estimate distribution parameters from the slope & intercept Slide 48

Fitting Data to an Exponential Distribution Constant Failure Rate (λ) R(t) = e- λt F = 1 -e- λt ℓn(1/1 -F) = λt Estimate λ This is in the form of y = mx Slide 49

Fitting Data to an Exponential Distribution Example: Failure data given. We think it is Exponential i ti Ln(1/1 -F) 1 2 80 134 0. 11778 0. 25132 3 4 148 186 0. 40546 0. 58778 5 6 238 450 0. 81093 1. 09861 7 8 581 890 1. 50407 2. 19722 Slide 50

Fitting Data to an Exponential Distribution • ℓn (1/1 -F) = λt ln(1/1 -F) 2. 5 R 2 = 0. 9783 • • • Y = mx Slope is λ λ =0. 0025 MTTF = 1 / λ MTTF = 400 hrs 2 1. 5 ln(1/1 -F) Linear(ln(1/1 -F)) 1 0. 5 t 0 0 200 400 600 800 1000 Slide 51

Fitting Data to Weibull Distribution (m, θ) • Slide 52

Fitting Data to Weibull Distribution (m, θ) Failure Data given. We think it is Weibull distributed. i ti Ln t Ln(Ln(1/1 -F)) 1 67 4. 204 -1. 706 2 120 4. 787 -0. 904 3 130 4. 867 -0. 366 4 220 5. 393 0. 092 5 290 5. 669 0. 582 Slide 53

Fitting Data to Weibull Distribution (m, θ) From Graph, m= 1. 53 (slope), θ = 197 hrs Ln(Ln 1/1 -F) 1 2 = 0. 9676 R 2 = R 0. 9676 0. 5 0 0 1 2 3 4 5 6 ln(ln 1/1 -F) -0. 5 Linear(ln(ln 1/1 -F)) -1 -1. 5 -2 Lnt Slide 54

Operational Reliability Analysis • Using Reducible Markov Chains – MARKOV CHAIN ANALYSIS • A Probabilistic Technique Slide 55

Space Transportation Vehicle, STV • • STV on the launch site, no problems Launch preparations Launch pad operations STV in powered ascent Orbital operations Re-entry Landing, Site-1 Post flight checkout Success oriented path Slide 56

What can go wrong ? • • • Delay due to problems in launch preparations Launch delay, minor problems Launch delay, major problems Abort Landing, Contingency site Post Flight Check, Minor problems Post Flight Check, Major problems Attrition Major Damage/scrap Slide 57

Reducible Markov Chains • What Information we can get? E= Expected number of times the process will cycle, before STV is trapped in an absorbing state (expected life) A= Probabilities of reaching a particular trapping (failing) state Slide 58

Operational Reliability Model for STV Slide 59

Results of Markov Chain Analysis E = 47. 98 LCC= $11, 018 Probability of Attrition = 0. 64 Probability of Major damage = 0. 36 LAUNCH RELIABILITY EXPECTED LIFE 0. 995 47. 98 0. 99 33. 66 0. 98 25. 31 0. 95 14. 5 Sensitivity Analysis Improved reliability makes a significant difference on the expected life of the STV. Slide 60

Maintained Systems I. Preventive Maintenance: Performed before Failure Occurs Measure: Resulting Increase In Reliability I. Corrective Maintenance: Performed after Failure Occurs (Repair) Measure: Availability: The Probability That System will be Operational When Needed Slide 61

Maintained Systems • Maintenance Issues – Cost – Safety – Prob. of Maintenance Introducing Failure – Human Reliability Repair Times & Maint Probability are more Variable than Failure Rates of Hardware Slide 62

Preventive Maintenance • Assume Ideal Preventive Maintenance: • System is Restored to as-good-as-new Condition. • How much reliability improvement from preventive maintenance? Slide 63

Preventive Maintenance- CFR Exponential: Constant Failure Rate • Preventive Maint. has No Effect On Reliability λ Exponential, Constant Failure Rate Time DON’T DO IT as Preventive Maintenance itself may introduce failures Slide 64

Preventive Maintenance (Wear Out) • Effect Of Preventive Maintenance on Aging or wear (Weibull m > 1) Failure Rate • WEIBULL R(t) = e –(t/Ѳ)m m>1 Time Preventive maintenance has a Positive Effect Slide 65

Preventive Maintenance Failure Rate WEIBULL m<1 WEIBULL m>1 EXPONENTIAL CFR time DON’T “LEAVE IT ALONE” DO Slide 66

Corrective Maintenance (Repair) • Corrective Maint: Performed after Failure • Interested in: • Reliability, but Also, • # of Failures • Time Required To Make Repairs Slide 67

Corrective Maintenance (Repair) • With corrective maintenance, two new parameters come into play: I. AVAILABILITY II. MAINTAINABILITY Slide 68

Corrective Maintenance (Repair) • AVAILABILITY: The probability that a system is available for use at a given time (the fraction of time a system is in an operational state) • MAINTAINABILITY: Is a measure of how fast a system may be repaired after a failure. Slide 69

AVAILABILITY • Slide 70

Steady State Availability • Slide 71

Mean Time to Repair (MTTR) • Slide 72

Availability • Slide 73

EXAMPLE i Tf (DAYS) Tr (DAYS) 1 12. 8 13 2 14. 8 3 25. 4 25. 8 4 31. 4 33. 3 5 35. 3 35. 6 6 56. 4 57. 3 7 62. 8 8 131. 2 134. 9 9 146. 7 150. 0 10 177. 1 Tf= Time Failed Tr = Time Repaired Slide 74

EXAMPLE • a) Calculate 6 month (182. 5 days) availability from data. There are 10 failures. • b) Estimate MTTF & MTTR From Data A(t) = 0. 937 MTTF = 16. 56 DAYS MTTR = 1. 15 DAYS Slide 75

Conclusions • • Reliability is major cost driver Reliability Definitions Failure Patterns, Distributions How to determine failure patterns Failure Data Analysis Methods Operational Reliability Modeling Maintainability, Maintenance Decisions Availability Slide 76

Resources • • Reliability & Maintainability Engineering: C. Ebeling. Reliability Engineering: E. E. Lewis. Handbook of Reliability Engineering. Military Standards: Procedures for performing failure modes, effects and criticality analysis. Slide 77