COP 5611 Operating Systems Spring 2010 Dan C

Lecture 13 n n Reading Assignment: Chapter 8 from the online textbook Homework 3

Reliable Systems from Unreliable Components n n Problem investigated first in mid 1940 s

Faults and errors n Fault a flaw with the potential to cause problems ¨

Error containment in a layered system n Several design strategies are possible. The layer

The fault-tolerance design process is iterative 1. Begin the design of a fault-tolerant model

Measures of reliability n TTF – time to failure ¨ n TTR – time

Reliability functions n Unconditional failure rate f(t) = Pr(module fails between t and t

Active faults n How to respond ¨ ¨ ¨ n Do nothing Fail-fast –

Fault tolerance models n n n Categorize all errors Evaluate the probability of occurrence

Errors in analog and digital systems n n Analog systems the designer specifies a

Slides: 16

Download presentation

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1: 00 -2: 00 PM

Lecture 13 n n Reading Assignment: Chapter 8 from the online textbook Homework 3 due on March 3 Midterm: Wednesday March 17, the first week after Spring Break Last time: End-to-end-layer ¨ Resource Management - Congestion ¨ n Today: Faults, Failures and Fault-Tolerant Design ¨ Measures of Reliability and Failure Tolerance ¨ Tolerating active Faults ¨ n Next time 2

Reliable Systems from Unreliable Components n n Problem investigated first in mid 1940 s by John von Neumann. Steps to build reliable systems ¨ Error detection n ¨ Error containment – limit the effect of errors n ¨ Network protocols (link and end-to-end) Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors n Network protocols: error correction, repetition, interpolation for data cu realtime constrains 3

Faults and errors n Fault a flaw with the potential to cause problems ¨ ¨ ¨ n Software Hardware Design Implementation Operation Environment Types of faults Latent ¨ Active ¨ n Error the consequence of an active fault. 4

Error containment in a layered system n Several design strategies are possible. The layer where an error occurs: Masks the error correct it internally so that the higher layer is not aware of it. ¨ Detects the error and report its to the higher layer fail-fast. ¨ Stops fail-stop. ¨ Does nothing. ¨ n Types of faults Transient (caused by passing external condition)/Persistent ¨ Soft /Hard Can be masked or not by a retry. ¨ Intermittent occurs only occasionally and it is not reproducible ¨ n Latency of a fault – time until a fault causes an error ¨ A long latency may allow errors to accumulate and defeat periodic error correction 5

The fault-tolerance design process is iterative 1. Begin the design of a fault-tolerant model 1. 2. 3. 4. 2. 3. Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: 1. 2. 4. 5. 6. Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world 1. 2. 7. Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again 6

Measures of reliability n TTF – time to failure ¨ n TTR – time to repair ¨ n n n MTTF – mean time to failure MTTF = 1/N ∑ TTFi MTTR – mean time to repair MTTR = 1/N ∑ TTRi MTBF – mean time between failures MTBF =MTTF + MTTR Availability =MTTF/MTBF Down time = ( 1 - Availability) = MTTR/MTBF 7

The conditional failure rate 8

Reliability functions n Unconditional failure rate f(t) = Pr(module fails between t and t = dt) n Reliability R(t) = Pr(module functions at time t given that it was functioning at time 0). This function is memoryless n 9

Active faults n How to respond ¨ ¨ ¨ n Do nothing Fail-fast – report the error Fail-safe – operate in an acceptable manner (e. g. , stop sign/flashing lights) Fail-soft – operate correctly with reduced performance Mask the error Errors Detectable the error can be reliably detected. ¨ Maskable Correctable ¨ Untolerated error undetactable, dndetected, unmaskable, unmasked. ¨ 11

Fault tolerance models n n n Categorize all errors Evaluate the probability of occurrence of each error. Modify the design to make the most likely errors detectable. Implement a detection procedure for each detectable error and identify the modules in which it is detected as fail-fast. Try to devise masking procedure for each detectable error. Evaluate the probability of occurrence, the cost of the masking procedure and the cost of failure for each detectable error 13

Errors in analog and digital systems n n Analog systems the designer specifies a tolerance to deal with small errors. Digital system Spatial redundancy ¨ Temporal redundancy ¨ n Coding ¨ Forward error-correction. 14