2 Fault Tolerance Reliable System Design 2010 by

2. Fault Tolerance Reliable System Design 2010 by: Amir M. Rahmani

Fault - Error - Failure l l Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior caused by a fault • l l l 2 – manifestation of fault Failure = inability of the system to perform its specified service Latent Fault = which has not yet produced error Latent Error = which has not yet produced failure matlab 1. ir

Fault - Error - Failure l 3 Note: presents of fault does not ensure that error will occur, e. g. memory stuck-at-0 matlab 1. ir

Origin of Defects in Objects (HW/SW) l Good object wearing out with age • • l Good object, unforeseen hostile environment • l l – Environmental fault Marginal object: occasionally fails in target environment • l – Hardware (software can age too) – Incorrect maintenance/operation – Tight design/bad inputs Implementation mistakes Specification mistakes Note: From Top to Down-> Increasing human responsibility 4 matlab 1. ir

Bathtub Curve l Three phases of system lifetime • • • 5 – Infant mortality – Normal lifetime – Wear-out period matlab 1. ir

Life-time of a Software system (1) 6 matlab 1. ir

Life-time of a Software system (2) l Software failure rate during useful life depends on the following factors: • • • 7 1 software process used to develop the design and code 2 complexity of software, 3 size of software, 4 experience of the development team, 5 percentage of code reused from a previous stable project, 6 depth of testing at test/debug (I) phase. matlab 1. ir

Faults Characteristics 1 - Cause – Specification errors very dangerous generic fault – Implementation errors • very hard to formally verify – Random component faults • random, not manufacturing defects – External disturbance • • 8 noise, EMP, vibration, radiation much like random component matlab 1. ir

Faults Characteristics 2 - Origin – software or hardware • • • Physical device level (HW) • Logic level (HW) • Chip level (HW) • System level (HW/SW) – interfacing, specifications, … – don’t care, except: • hardware can be analog • indeterminate voltage level 9 matlab 1. ir

Faults Characteristics 3 - Duration – Permanent fault • • occurs and doesn’t go away easiest to diagnose – Transient fault • • occurs once and disappears 10 times as expected as permanent fault – Intermittent fault • • • 10 occurs occasionally may appear to be transient (if long period) hard and expensive to detect matlab 1. ir

Faults Characteristics 4 - Extent – Global • A power supply fault – Local • A memory fault 5 - Value – Determinate • memory stuck-at-0 – Indeterminate • 11 A sensitive fault to data or time matlab 1. ir

What to do about Faults l Finding & identifying faults: • • • l • Fault detection: is a fault there? • Fault location: where? • Fault diagnosis: which fault it is? Automatic handling of faults • • Fault containment: blocking error flow – Fault masking: fault has no effect • 12 • Fault recovery: back to correct operation matlab 1. ir

System Response to faults l l Error on output: may be acceptable in noncritical systems if happens only rarely Fault masking: output correct even when fault from a specific class occurs • l Fault-secure: output correct or error indication • l – Retryable: banking, telephony Fail safe: output correct or in safe state • 13 – Critical applications: air/space/manufacturing – Flashing red traffic light, disabled ATM matlab 1. ir

What is Fault-Tolerance? l l 14 A fault-tolerant system is one that continues to perform at desired level of service according to their specification, in the presence of faults. There are no failures in a fault-tolerant system. Fault-tolerance is the ability of a system to provide a service complying with the specification in spite of faults. A better title might have been Dependable or Reliable or Available computing matlab 1. ir

Fault Tolerance l In the physical universe: • • • l In the informational universe: • • • 15 - Fault detection - Fault location - Fault containment - Fault recovery - Continue servicing - Error detection - Error location - Error containment - Error recovery - Continue servicing matlab 1. ir

Fault Recovery l l How quickly is the fault detected? How soon can recovery begin? • • l How long does recovery take? • • 16 – Does is require human intervention – How is the system admin notified? – Restore from backup? – Purchase new HW? matlab 1. ir

Fault Coverage (C) l Measure of system’s ability to perform: • • l l l C = P (fault detection | fault occurrence) C = P (fault recovery | fault occurrence) Note: • • • 17 – fault detection – fault location – fault containment – (and/or fault recovery) – recovery implies that the system as a whole is operational this does not imply that a repair occurred – e. g. duplex system with benign fault can recover to continue operation on one non-faulty processor matlab 1. ir

Design Philosophies to Combat Faults l Fault avoidance (off-line) • Attempts to prevent fault in the: • • • l Design review Component selecting Quality control Shielding Testing Fault masking (on-line) • Attempts to prevent a fault in a system from introducing errors • Error correcting memory • Majority voting l Fault tolerance (on-line) • 18 Attempts to provide a system to continue performing its expected tasks after the occurrence of faults matlab 1. ir

Design Philosophies to Combat Faults Fault avoidance 19 Fault masking Fault tolerance matlab 1. ir

Fault Avoidance vs. Tolerance l Fault avoidance: eliminate problem sources • • l – Remove defects: Testing and debugging – Robust design: reduce probability of defects – Minimize environmental stress: Radiation shielding etc – Impossible to avoid faults completely Fault tolerance: add redundancy to mask effect • • – Additional resources needed (more later) – Examples: • Error correction coding • Backup storage • Spare tire etc 20 matlab 1. ir

Fault Forecasting vs. Tolerance l Fault Tolerance • l Execution-time techniques that handle with the effects of faults Fault Forecasting • Estimate current number, future incidence and likely consequences • • • You can’t tolerate what you don’t expect But if we expected it, we would avoid or eliminate the fault! In general: We can itemize the classes of faults that can occur We can define what we want done if the fault occurs and if the error is detected Example: Automobile tire • Lose air • Do not expect it to experience electrical overload 21 matlab 1. ir

Fault Tolerant computing l Deterministic approaches • • l Probabilistic approaches • • • 22 – Based on simplifying assumptions: “fault model” – Obtain methods using the models: test generation – Evaluation of effectiveness – Used for Testing & combinatorial fault-tolerance – We can’t predict exactly when a person will die, but we can still get “life expectancy = 77. 2”, if we have data – Used for evaluating, achieving and optimizing reliability – Random testing matlab 1. ir

Fault Tolerant vs. Performance l l There are many Fault-tolerance approaches that sacrifice performance to tolerate faults Ex. 1: • • l Ex. 2: • • l – Run two identical systems in parallel, compare their results before using them Ex. 4: • 23 – Log all changes made to system state in case recovery is needed * During recovery, undo the changes from the log Ex. 3: • l – Periodically stop the system and checkpoint its state to disk. * If fault occurs, recover state from checkpoint and resume – Run software with lots of error checking matlab 1. ir

Fault Tolerant vs. Cost l l There are many Fault-tolerance approaches that sacrifice cost to tolerate faults Ex. 1: • l Ex. 2: • l – Mirror the disks (RAID-1) to tolerate disk failures Ex. 3: • 24 – Replicate the hardware 3 times and vote to determine correct output – Use multiple independent versions of software to tolerate bugs (Called N-version programming) matlab 1. ir

Fault Tolerant vs. Power l l There are many Fault-tolerance approaches that sacrifice power to tolerate faults Ex. 1, 2 & 3 (same as previous slide) • • • l Ex. 4 • l – Add continuously running checking hardware to system Ex. 5 • 25 – Replicate the hardware 3 times and vote to determine correct output – Mirror the disks (RAID-1) to tolerate disk failures – Use multiple independent versions of software to tolerate bugs – Add extra code to check for software faults matlab 1. ir

Need for Fault Tolerance: Universal l Natural objects: • • l Man-made objects • • • 26 • Fat deposits in body: survival in starvation • Duplication of eyes: graceful degradation upon failure • Redundancy in ordinary text • Asking for password twice during initial setup • Duplicate tires in trucks matlab 1. ir

Forms of Redundancy l Hardware redundancy • l Software redundancy • l – extra information, i. e. codes Time redundancy • 27 – add extra software for detection and possibly tolerating faults Information redundancy • l – add extra hardware for detection or tolerating faults – extra time for performing tasks for fault tolerance matlab 1. ir

Redundancy base l Time Try l Retry Space Try Try 28 matlab 1. ir

Time redundancy l Suppose the data is transmitted over a parallel bus 1 - At time t 0, the original data is transmitted. 2 - Then, the data is complemented and retransmitted at time t 0+ ΔD. 3 - The two results are compared to check whether they are complements of each other 4 - Any disagreement indicates a fault • • 29 matlab 1. ir

Time redundancy - Example Alternating logic concept can be used for detecting fault in logic circuits which implement self-dual functions. 30 matlab 1. ir
- Slides: 30