System Reliability Resit Unal Engineering Management Systems Engineering
System Reliability Resit Unal Engineering Management & Systems Engineering Dept. Old Dominion University runal@odu. edu Slide 1
System Life Cycle Concepts NEED 1. CONCEPTUAL PRELIMINARY DESIGN Acquisition 2. DETAIL DESIGN & DEVELOPMENT 3. PRODUCTION/CONSTRUCTION Utilization 4. OPERATION & SUPPORT 5. PHASE OUT / DISPOSAL Slide 2
Lifecycle Costs (LCC) • 60 – 80% of LCC spent during operation phase • Reliability is major cost driver (Failures, repair, lost operation time, redesign. . ) • 70% of LCC committed during Design Phase Design fixes how system will be operated, maintained Slide 3
System Reliability • Engineering is concerned with how products/systems work, but also need to understand, • The ways in which they fail, effects of failures, & aspects of design which affect the likelihood of failure, • Reliability Engineering. Slide 4
Defining Reliability is the probability that a given system will perform as anticipated under given operating conditions. It can predict the probability that a system will operate for a specified # of hours or a certain average time between failures. Slide 5
Failure Patterns: Bath Tub Curve λ(t) Failure Rate Decreasing Constant (Random Failure) Increasing Failure time Burn-In Useful Life Wear-out Slide 6
Non-Repairable systems • The instantaneous probability of the first and only failure is called the failure rate. • Mean Time to Failure (MTTF) Slide 7
Repairable systems • Slide 8
Tasks of Reliability F(t) = P(t < T) R(t) = 1 - F(t) T = time to failure I) First task is to derive & study this equation II) Find the best way to increase Reliability Slide 9
Find Best Ways to Increase Reliability 1. Reduce complexity 2. Increase R of components/subsystems 3. Parallel redundancy 4. Stand-by redundancy 5. Preventive Maintenance 6. Repair 7. Combination Slide 10
Failure f(t): Exponential Distribution 1. Failures occur at random intervals. 2. Failure rate stays constant (time independent). f(t) EXPONENTIAL Constant Failure Rate (CFR) λ(t) ↓ λ λ(t)↑ t Slide 11
Exponential Distribution; CFR f(t) = λe-λt Failure Rate R(t) = e-λt λ F(t) = 1 – e-λt MTTF = 1/ λ time (time unit) Constant Failure Rate (time independent) Slide 12
Example Slide 13
Time Dependent Failure Distributions • Failure Rate Time Slide 14
Weibull Distribution Failure Rate Time Slide 15
Weibull Distribution • Example: Ball Bearing, Weibull Distribution m=4 θ = 100 Failure Rate 50 Hour Mission F(t) = 1 – e –(t/θ)m time F(50) = 0. 0606 R(50)=0. 9394 Slide 16
General System Reliability Models • 1. Series (non-redundant) system R 1 R 2 1 2 R 1 = e –λ 1 t R 2 = e –λ 2 t Rss = R 1. R 2 Slide 17
Series System 0. 85 R 1 R 2 Rss = (0. 85) = 0. 7225 For series systems, high reliability of components/subsystems are required. Slide 18
Parallel Reliability Model • Active Redundancy R 1 RPS = R 1 + R 2 – R 1 R 2 Reliability for parallel System: Slide 19
Parallel Reliability Model • Two Component System RPS = R 1 + R 2 - R 1 R 2 R 1 = R 2 = 0. 85 RPS = 0. 98 0. 85 R 1 R 2 0. 85 Active redundancy: Reliability increases Slide 20
m out–of–N Units System • Active redundancy. At least m units out of N must function for the system to operate normally. If identical, independent, → Binomial Distribution M Rm/N = ∑ (m. N) Rm (1 -R) N-m m m/N Active Redundancy Slide 21
m–out–of–N Units System Aircraft has 4 identical, independent engines with R = 0. 98 At least 2 engines must function (Active redundancy). N=4 • R 2/4 = ∑ ( m 4 ) (0. 98)m(1 - 0. 98)4 -m m=2 • R 2/4 = 0. 99996 Slide 22
Complex System Reliability Analysis Methods • • Network Reduction Approach Fault Tree Analysis FMEA Failure Modes and Effects Analysis FMECA Failure Modes, Effects and Criticality Analysis Slide 23
System Reliability Analysis Methods I. Network Reduction Approach Ex a 1 b 1 a 2 c a 3 R system b 2 a 4 RSYST = (2 Rb – R 2 b ) Rc Reliability Block Diagram Slide 24
Fault-Tree Analysis (FTA) • FTA – Top down approach (Bell Labs/Boeing) – Start with identifying an undesirable event TOP EVENT – Events that can lead to the Top-Event are described with Logic Operators (AND, OR, EOR. . ) Slide 25
FTA Logic Operators • AND Gate • OR Gate • AND Gate: Provides a True Out-Put if ALL inputs are True. A B AND A B 0 0 1 1 1 Slide 26
FTA Logic Operators • OR GATE : Provides a true output if one or more inputs are true A B OR 0 0 1 1 1 0 1 1 Slide 27
FTA Logic Operators n • FOR = 1 - ∏ (1 - Fi) i-1 n • FAND = ∏ Fi i-1 Slide 28
FTA Example System designed to deliver emergency cooling to a nuclear reactor. Protection system will not deliver a signal to pump & valve actuators (p of failure = 0. 0001) Pump will fail to start when the actuation signal is received (p = 0. 02) A valve will fail to open when the actuation signal is received (p = 0. 1) The reservoir will be empty at the time of the accident (p = 0. 00005) Slide 29
FTA: Emergency cooling to a nuclear reactor Slide 30
FTA: Emergency Cooling to a Nuclear Reactor Using FTA Analysis: • Probability of failure = 0. 000915 • Reliability = 0. 9991 Slide 31
FTA Use Advantages/Issues 1. 2. 3. 4. 5. One event at a time Provides insight into system behavior Top-down approach FTA can get complicated for large systems Difficult to handle degraded component states. Slide 32
FMEA = Failure Modes and Effects Analysis • Concerned with determining design R by considering potential failures and their effects on the system. • List each failure mode and effect on paper. • Bottom-Up Approach. Slide 33
FMEA • “Military Standards: Procedures for performing failure modes, effects and criticality analysis” (1980) • TYPICAL STEPS IN FMEA: 1. SYSTEM DEFINITION. Identify systems that may fail. Slide 34
FMEA 2. IDENTIFICATION OF FAILURE MODES. Ways components may fail: • • • Short Rupture Fracture Power Loss Out-of-Tolerance • Operational & Environmental Conditions should be listed. Slide 35
FMEA 3. DETERMINE CAUSE. – – – – Stress Contamination Evaporation Fatigue Wear-Out Corrosion Errors Slide 36
FMEA Documentation Failure Mode Cause Failure Mechanism Action Fracture Excessive Vibration Fatigue Redesign Mounts Slide 37
FMEA Documentation 4. ASSESSMENT OF THE EFFECT. (leakage, rupture) Failure Mode Cause Failure Mechanism Effect Brittle seal Sustained low temperature Leakage Critical Slide 38
FMEA 5. CLASSIFICATION OF SEVERITY I. III. IV. Catastrophic: Critical: Marginal: Negligible: Major damage/loss of life Mission may be lost System degraded Minor with no effect on perf. Slide 39
FMEA 6. PROBABILITY OF OCCURRENCE. Reliability testing, Failure Data, Expert Judgment When NO sufficient Data Exist: Military Standard: Procedures performing a FMECA (1980) Slide 40
FMEA • Slide 41
FMECA • • • List: Failure Modes Causes of failure Possible Effects Probability of Occurrence Criticality Possible Action FMECA Handbook of Reliability Engineering and Management Slide 42
FMECA Slide 43
FMEA/FMECA • Serve as each possible failure mode detection technique • All possible failure modes & effects on mission, people, & system can be identified • Provide useful input data in performing system safety and maintainability analysis • Systematic approach to classify hardware failures Slide 44
FMEA/FMECA • Provides input for development of built in test software and equipment • Can be used for design comparison studies • Provides improved communication • Procedure begins from detailed level and works upward. Slide 45
Failure Data Collection, Analysis • FAILURE DATA USES 1. 2. 3. 4. 5. 6. 7. 8. Compute Failure Rate Determine failure distribution Decisions on Redundancy Trade-off Studies Replacement Studies Preventative Maintenance Decisions Availability Design Changes Slide 46
Failure Data Collection, Analysis • LIFE TESTING – Time-to-failure (DOE Techniques) • FIELD DATA – # of Failures Slide 47
Identifying Failure Distribution We try to fit the data to a known distribution f(t) 1. Collect data 2. Hypothesize a distribution 3. Plot data on appropriate graph paper for this distribution 4. If there is a good fit: the data points will be clustered along a straight line 5. Estimate distribution parameters from the slope & intercept Slide 48
Fitting Data to an Exponential Distribution Constant Failure Rate (λ) R(t) = e- λt F = 1 -e- λt ℓn(1/1 -F) = λt Estimate λ This is in the form of y = mx Slide 49
Fitting Data to an Exponential Distribution Example: Failure data given. We think it is Exponential i ti Ln(1/1 -F) 1 2 80 134 0. 11778 0. 25132 3 4 148 186 0. 40546 0. 58778 5 6 238 450 0. 81093 1. 09861 7 8 581 890 1. 50407 2. 19722 Slide 50
Fitting Data to an Exponential Distribution • ℓn (1/1 -F) = λt ln(1/1 -F) 2. 5 R 2 = 0. 9783 • • • Y = mx Slope is λ λ =0. 0025 MTTF = 1 / λ MTTF = 400 hrs 2 1. 5 ln(1/1 -F) Linear(ln(1/1 -F)) 1 0. 5 t 0 0 200 400 600 800 1000 Slide 51
Fitting Data to Weibull Distribution (m, θ) • Slide 52
Fitting Data to Weibull Distribution (m, θ) Failure Data given. We think it is Weibull distributed. i ti Ln t Ln(Ln(1/1 -F)) 1 67 4. 204 -1. 706 2 120 4. 787 -0. 904 3 130 4. 867 -0. 366 4 220 5. 393 0. 092 5 290 5. 669 0. 582 Slide 53
Fitting Data to Weibull Distribution (m, θ) From Graph, m= 1. 53 (slope), θ = 197 hrs Ln(Ln 1/1 -F) 1 2 = 0. 9676 R 2 = R 0. 9676 0. 5 0 0 1 2 3 4 5 6 ln(ln 1/1 -F) -0. 5 Linear(ln(ln 1/1 -F)) -1 -1. 5 -2 Lnt Slide 54
Operational Reliability Analysis • Using Reducible Markov Chains – MARKOV CHAIN ANALYSIS • A Probabilistic Technique Slide 55
Space Transportation Vehicle, STV • • STV on the launch site, no problems Launch preparations Launch pad operations STV in powered ascent Orbital operations Re-entry Landing, Site-1 Post flight checkout Success oriented path Slide 56
What can go wrong ? • • • Delay due to problems in launch preparations Launch delay, minor problems Launch delay, major problems Abort Landing, Contingency site Post Flight Check, Minor problems Post Flight Check, Major problems Attrition Major Damage/scrap Slide 57
Reducible Markov Chains • What Information we can get? E= Expected number of times the process will cycle, before STV is trapped in an absorbing state (expected life) A= Probabilities of reaching a particular trapping (failing) state Slide 58
Operational Reliability Model for STV Slide 59
Results of Markov Chain Analysis E = 47. 98 LCC= $11, 018 Probability of Attrition = 0. 64 Probability of Major damage = 0. 36 LAUNCH RELIABILITY EXPECTED LIFE 0. 995 47. 98 0. 99 33. 66 0. 98 25. 31 0. 95 14. 5 Sensitivity Analysis Improved reliability makes a significant difference on the expected life of the STV. Slide 60
Maintained Systems I. Preventive Maintenance: Performed before Failure Occurs Measure: Resulting Increase In Reliability I. Corrective Maintenance: Performed after Failure Occurs (Repair) Measure: Availability: The Probability That System will be Operational When Needed Slide 61
Maintained Systems • Maintenance Issues – Cost – Safety – Prob. of Maintenance Introducing Failure – Human Reliability Repair Times & Maint Probability are more Variable than Failure Rates of Hardware Slide 62
Preventive Maintenance • Assume Ideal Preventive Maintenance: • System is Restored to as-good-as-new Condition. • How much reliability improvement from preventive maintenance? Slide 63
Preventive Maintenance- CFR Exponential: Constant Failure Rate • Preventive Maint. has No Effect On Reliability λ Exponential, Constant Failure Rate Time DON’T DO IT as Preventive Maintenance itself may introduce failures Slide 64
Preventive Maintenance (Wear Out) • Effect Of Preventive Maintenance on Aging or wear (Weibull m > 1) Failure Rate • WEIBULL R(t) = e –(t/Ѳ)m m>1 Time Preventive maintenance has a Positive Effect Slide 65
Preventive Maintenance Failure Rate WEIBULL m<1 WEIBULL m>1 EXPONENTIAL CFR time DON’T “LEAVE IT ALONE” DO Slide 66
Corrective Maintenance (Repair) • Corrective Maint: Performed after Failure • Interested in: • Reliability, but Also, • # of Failures • Time Required To Make Repairs Slide 67
Corrective Maintenance (Repair) • With corrective maintenance, two new parameters come into play: I. AVAILABILITY II. MAINTAINABILITY Slide 68
Corrective Maintenance (Repair) • AVAILABILITY: The probability that a system is available for use at a given time (the fraction of time a system is in an operational state) • MAINTAINABILITY: Is a measure of how fast a system may be repaired after a failure. Slide 69
AVAILABILITY • Slide 70
Steady State Availability • Slide 71
Mean Time to Repair (MTTR) • Slide 72
Availability • Slide 73
EXAMPLE i Tf (DAYS) Tr (DAYS) 1 12. 8 13 2 14. 8 3 25. 4 25. 8 4 31. 4 33. 3 5 35. 3 35. 6 6 56. 4 57. 3 7 62. 8 8 131. 2 134. 9 9 146. 7 150. 0 10 177. 1 Tf= Time Failed Tr = Time Repaired Slide 74
EXAMPLE • a) Calculate 6 month (182. 5 days) availability from data. There are 10 failures. • b) Estimate MTTF & MTTR From Data A(t) = 0. 937 MTTF = 16. 56 DAYS MTTR = 1. 15 DAYS Slide 75
Conclusions • • Reliability is major cost driver Reliability Definitions Failure Patterns, Distributions How to determine failure patterns Failure Data Analysis Methods Operational Reliability Modeling Maintainability, Maintenance Decisions Availability Slide 76
Resources • • Reliability & Maintainability Engineering: C. Ebeling. Reliability Engineering: E. E. Lewis. Handbook of Reliability Engineering. Military Standards: Procedures for performing failure modes, effects and criticality analysis. Slide 77
- Slides: 77