Fault Tolerant Systems Unit I Introduction Redundancy Techniques

  • Slides: 25
Download presentation
Fault Tolerant Systems

Fault Tolerant Systems

Unit I Introduction Redundancy Techniques- Hardware Redundancy Software Redundancy Information Redundancy Time Redundancy Reliability

Unit I Introduction Redundancy Techniques- Hardware Redundancy Software Redundancy Information Redundancy Time Redundancy Reliability Modeling and Evaluation Empirical Models Analytical Techniques. Fault Modeling

Introduction Fault tolerance, reliability and availability are becoming major design issues in massively parallel

Introduction Fault tolerance, reliability and availability are becoming major design issues in massively parallel distributed computing systems.

Introduction Examples: mission-critical- aircraft/railway operating and control system, computation-intensive, transactions, mobile/wireless computing systems/networks.

Introduction Examples: mission-critical- aircraft/railway operating and control system, computation-intensive, transactions, mobile/wireless computing systems/networks.

Fundamental Concepts in Fault Tolerance and Reliability Analysis

Fundamental Concepts in Fault Tolerance and Reliability Analysis

Outline i. Introduction ii. Redundancy techniques i. Hardware redundancy ii. Software redundancy iii. Information

Outline i. Introduction ii. Redundancy techniques i. Hardware redundancy ii. Software redundancy iii. Information redundancy iv. Time redundancy iii. Reliability Modelling and Evaluation i. Empirical Models ii. The Analytical Technique i. Combinatorial (continuous) Model

Introduction Computer System Development phases: Specification Design Prototyping Implementation phase Fault: A physical defect

Introduction Computer System Development phases: Specification Design Prototyping Implementation phase Fault: A physical defect that takes place in some parts of a system. Faults-> manifest into Errors-> Failure

Fault categorization Cause Duration Nature Specification Mistake Permanent Hardware Global Determinate Design mistake Transient

Fault categorization Cause Duration Nature Specification Mistake Permanent Hardware Global Determinate Design mistake Transient Software Implementation mistake Intermittent Component Defects External Disturbance Extent Value Local Indeterminate

Fault categorization 1. Permanent fault: s/w bugs 2. Transient fault: occurs & disappears at

Fault categorization 1. Permanent fault: s/w bugs 2. Transient fault: occurs & disappears at an unknown frequency– Example - lightning effect 3. Intermittent fault: occurs & disappears at a frequency that can be characterized. Example: loose contact Fault causes can lead to h/w or s/w errors. This can lead to system failure in delivering its function.

Fault categorization Different techniques to deal with fault: Fault avoidance: to prevent occurrence of

Fault categorization Different techniques to deal with fault: Fault avoidance: to prevent occurrence of faultsquality control- design review, component screening, testing, shielding from interference etc. Fault masking: to prevent faults from introducing errors-e. g. error correcting codes, majority voting etc. Fault tolerance: FTS

Fault categorization Fault tolerance: A FTS is a system that continues to function correctly

Fault categorization Fault tolerance: A FTS is a system that continues to function correctly in the presence of hardware failures and/or software errors. Typical FTS has following attributes: Fault detection Location Containment-allow a system to limit the impact of manifested faults to some predefined system boundaries. Recovery

Fault Model 1. Fault model: logical abstraction that describes the functional effect of the

Fault Model 1. Fault model: logical abstraction that describes the functional effect of the physical defect. 2. Fault modeling can be made at different levels: From lowest physical geometric level, switch level, gate level to the functional level-higher level. Lower level- computationally intensive method to detect the fault. Higher level- less accurate in representing the actual physical defect.

Failure Rate 1. Failure rate of a system: λ 2. Defined as the expected

Failure Rate 1. Failure rate of a system: λ 2. Defined as the expected number of failures of the system per unit time. 3. Reliability R(t) of a system is defined as the conditional probability that the system operates correctly throughout the interval [t 0, t] given that it was operating correctly at t 0. 4. Mean-Time-To-Failure (MTTF): the expected time that a system will operate before the first failure occurs.

System Repair 1. Repair of a system requires the use of a workshop characterized

System Repair 1. Repair of a system requires the use of a workshop characterized by a repair rate μ. 2. Repair rate μ: average number of repairs that can be performed per unit time. 3. Mean-Time-To-Repair (MTTR): expected time that the system will take while in repair.

Availability 1. Availability of the system A(t): probability that the system is operating correctly

Availability 1. Availability of the system A(t): probability that the system is operating correctly at instant t. 2. Ass: steady state availability of the system

Reliability 1. No service degradation can be tolerated: space missions- highly reliable system is

Reliability 1. No service degradation can be tolerated: space missions- highly reliable system is demanded 2. Short service degradation can be toleratedbanking systems- highly available system is demanded.

Other attributes 1. Maintainability M(t): probability that a failed system will restored to operation

Other attributes 1. Maintainability M(t): probability that a failed system will restored to operation within time t. 2. Safety S(t): probability that the system either performs correctly or discontinues without disturbance to other systems.

Redundancy Techniques 1. Extra resources: for a system to deliver its expected service in

Redundancy Techniques 1. Extra resources: for a system to deliver its expected service in the presence of errors caused by faults. 2. Forms: Hardware redundancy Software redundancy Information redundancy Time redundancy

Hardware redundancy Extra hardware: concurrent operations i. Passive (static) hardware redundancy ii. Active (dynamic)

Hardware redundancy Extra hardware: concurrent operations i. Passive (static) hardware redundancy ii. Active (dynamic) hardware redundancy iii. Hybrid hardware redundancy

Passive (static) hardware redundancy 1. The effect of faults are masked with no specific

Passive (static) hardware redundancy 1. The effect of faults are masked with no specific indication of their occurrence. The effects are hidden from the rest of the system. 2. N-Modular Redundancy (NMR) 3. Simple but Expensive 4. Provide uninterrupted service in the presence of faults

Active (dynamic) hardware redundancy Involves removal or replacement of the faulty units in the

Active (dynamic) hardware redundancy Involves removal or replacement of the faulty units in the system in response to the system failure. Triggered by- internal error detection mechanisms in the faulty units or by detection of errors in the outputs of these units.

Software Redundancy i. Static software redundancy techniques i. N-version programming (NVP) ii. Transactions iii.

Software Redundancy i. Static software redundancy techniques i. N-version programming (NVP) ii. Transactions iii. Ad-hoc techniques ii. Dynamic software redundancy techniques i. Forward error recovery ii. Backward error recovery iii. Use of recovery blocks iv. Some other software fault detection techniques

Information redundancy 1. Error detecting codes i. Borden codes ii. Parity codes iii. Berger

Information redundancy 1. Error detecting codes i. Borden codes ii. Parity codes iii. Berger code iv. Bose code 2. Error correcting codes i. Hamming codes 3. SEC-DED codes i. Polynomial representation of binary vectors ii. Cyclic codes iii. Generating and decoding Systematic Cyclic Codes 4. CRC codes

Time Redundancy i. Permanent Error Detection and Time Redundency i. Recomputing with Shifted Operands

Time Redundancy i. Permanent Error Detection and Time Redundency i. Recomputing with Shifted Operands (RESO)

Reliability Modelling and Evaluation i. Empirical Models ii. The Analytical Technique i. Combinatorial (continuous)

Reliability Modelling and Evaluation i. Empirical Models ii. The Analytical Technique i. Combinatorial (continuous) Model ii. Discrete (Markov ) Model