Software Fault Tolerance and Recovery Introduction to Software































- Slides: 31
Software Fault Tolerance and Recovery Introduction to Software Fault Tolerance Francis Palma 19 th June 2017 Lakehead University Thunder Bay, ON
Reference Books
Types of Systems Common requirement the. Reliable softwares. Software used in the above systems? Safefor and ©Francis Palma, Lakehead University, 2017 1 of 19
Examples of Events Problems in the backup tracking software delayed the launch of Atlantis for 3 days Suffered a 9 -hour USwide blockade when one switch experienced abnormal behavior and attempted recovery, because of a flaw in recovery-recognition software ©Francis Palma, Lakehead University, 2017 In Gulf War, clock drift in the Patriot system caused it to miss a scud missile hitting an American barracks, killing 29 and injuring 97. The clock drift was caused by the use of two different representations (24 & 48 bit) of the value 0. 1 in software In 2016, Qatar Airways walked away from 4 -6 aircraft orders after problems affected the A 320 s hydraulics and software 2 of 19
Topics Covered • Fault, Failure, and Error • Dependable Software and Means to Achieve Dependability • Types of Error Recovery in Software Fault Tolerance • Types of Redundancy for Software Fault Tolerance • Checking The Understanding
An Introduction to Fault, Failure, and Error
Fault, Failure, and Error • A fault is the identified cause of an error, also known as ‘bug’ • The actual 'mistake' in the code • An error is part of the system state that is liable to lead to a failure • The bad state in the system that results from the fault • A failure is the variation from expected behaviour observed by the user as a result of an error • Software Fault Tolerance • To prevent failures by tolerating faults whose occurrences are known when errors are detected ©Francis Palma, Lakehead University, 2017 3 of 19
Fault A state that may lead to failure Error Results in Bugs or mistake in the code Causes Fault, Failure, and Error: The Relationship Actual result deviates from expected result Failure Represented by Exception ©Francis Palma, Lakehead University, 2017 4 of 19
Fault, Failure, and Error: An Example 1 int double. Value (int i) { 2 int result; 3 result = i * i; 4 print result; 5 } Fault: Line 3 should be result = 2 * i Error: If i = 1 then result = 1 * 1, which is 1; If i = 3 then result = 3 * 3, which is 9, . . . Failure: If i = 1 then print result as 1; If i = 3 then print result as 9, … ©Francis Palma, Lakehead University, 2017 5 of 19
An Introduction to Dependable Software
Dependable Software Faults Impairments Errors Failures • For a dependable software: 1. the users will have full trust on it 2. the users will have confidence that it will operate as expected and it will not ‘fail’ in normal use Construction Dependability Means Validation Fault avoidance 1 Fault tolerance 4 Fault removal 2 Fault forecasting 3 Availability Reliability Attributes Safety Confidentiality Integrity ©Francis Palma, Lakehead University, 2017 Maintainability 6 of 19
Fault Avoidance • Fault avoidance techniques that contribute to system dependability include: • Rigorous System Requirements Specification: System failure may occur due to logic errors incorporated into the requirements specification • Structured Design and Programming Methods: The principles of decoupling, modularization, and encapsulation (e. g. information hiding) reduces overall complexity of the software, making it easier to understand implement • Software Reuse: Reduces the number of components that must be originally developed (object-oriented principles) • Despite fault prevention efforts, faults are created, so Fault Removal is required ©Francis Palma, Lakehead University, 2017 7 of 19
Fault Removal • Fault removal techniques contribute to system dependability during software verification and validation: • What Software Testing is the most common fault removal technique? • Formal Inspection: A rigorous process to examine source code to find and correct the faults, and then verify the corrections (widely applied in industry) • Formal Design Proofs: Using executable specifications, test cases can be automatically generated to improve the software verification process • Fault removal is not perfect, so Fault Forecasting and Fault Tolerance are needed ©Francis Palma, Lakehead University, 2017 8 of 19
Fault Forecasting • Fault forecasting is done during the validation of software to estimate the presence of faults and usually focuses on the reliability measure of dependability: • Reliability Estimation • Determines current software reliability by applying statistical inference techniques to failure data obtained during system testing (or system operation) • Reliability Prediction • Determines future software reliability based upon available software metrics and measures • Fault forecasting can indicate the need for Fault Tolerance ©Francis Palma, Lakehead University, 2017 9 of 19
Fault Tolerance • Fault tolerance techniques contribute to system dependability during software development include: • Single Version Software Environment: Partially tolerates software design faults through monitoring techniques or exception handling • Multiple Version Software Environment (design diverse): Functionally equivalent independently developed software versions can provide tolerance to faults • Examples: Recovery Blocks (Rc. B), N-version programming (NVP), and N self-checking programming (NSCP) • Multiple Data Representation Environment (data diverse): Different representations of input data are utilized to provide tolerance to software design faults • Examples: Retry Blocks (Rt. B) and N-Copy Programming (NCP) ©Francis Palma, Lakehead University, 2017 10 of 19
The Fault Tolerance Process and Types of Error Recovery
The Fault Tolerance Process • A set of activities with the goal to remove errors and their effects from the computational state, before a failure occurs Error Detection An erroneous state is identified ©Francis Palma, Lakehead University, 2017 Error Diagnosis The damage is assessed and the cause of the error is determined Error Containment / Isolation Further damages are prevented, i. e. , the error is prevented from propagating Error Recovery The erroneous state is replaced with an errorfree state 11 of 19
Types of Error Recovery • Backward Recovery • Attempts to return the system to a previously saved error-free state by restoring or rolling back the system • System states are saved at predetermined recovery points called checkpoints • Advantages: • Can handle unpredictable errors caused by unresolved design faults • Requires no knowledge of the errors in the system state • Disadvantages: • Requires significant resources (e. g. , time, computation, and stable storage) • The system might need to be halted temporarily • Domino effect may occur, i. e. , a series of interdependent roll-backs ©Francis Palma, Lakehead University, 2017 12 of 19
Backward Recovery *Source: Software Fault Tolerance Techniques and Implementation Book by Laura L. Pullum, 2001. ©Francis Palma, Lakehead University, 2017 13 of 19
Types of Recovery • Forward Recovery • With a full backup image, roll forward through the archive logs to recover to a specific System Change Number (SCN), Date/Time, or until an administrator cancels the recovery process • Alternatively, error compensation based on redundancy model can be used where redundant software processes are executed in parallel from which a Fault Detection and Handling Unit (a. k. a. Adjudicator) selects the one with correct result, e. g. , the NVP fault tolerance technique • Advantages: • Fairly efficient in terms of the overhead (time and memory) • Anticipated faults or potential loss of data can be well handled using redundancy and forward recovery • Disadvantages • Requires thorough knowledge of the error • Application-specific and must be tailored to each situation or program ©Francis Palma, Lakehead University, 2017 14 of 19
Forward Recovery *Source: Software Fault Tolerance Techniques and Implementation Book by Laura L. Pullum, 2001. ©Francis Palma, Lakehead University, 2017 15 of 19
Types of Redundancy and Software Fault Tolerance
Types of Redundancy • Hardware redundancy includes replicated and supplementary hardware added to the system to support fault tolerance • The most common use of redundancy • Software redundancy includes the additional programs, modules, or objects used in the system to support fault tolerance • Information or data redundancy uses additional information with data to assist in hardware or software fault tolerance • Temporal redundancy involves the use of additional time to perform the tasks required to support fault tolerance ©Francis Palma, Lakehead University, 2017 16 of 19
Software Redundancy • Software faults cannot be detected by simple replication of identical software units -- the same fault will exist in each copy • Solution: Introduce diversity into the software replicas • Basic Approach: Start with the same specification and have different programming teams develop the variants independently, which will result in functionally equivalent, design-diverse software components • However, we need to decide on the acceptability of the results obtained by the variants. The component that performs this task is called the Adjudicator ©Francis Palma, Lakehead University, 2017 17 of 19
Review of the Lecture (1) In practice, software development is not error-free even if the best people, practices, and tools are used. (2) The goal of software fault tolerance is to prevent failures by tolerating faults whose occurrences are known when errors are detected. (3) Four means to achieve a dependable software are: fault avoidance, fault removal, fault/failure forecasting, and fault tolerance. (4) The fault tolerance process consists of four activities: error detection, error diagnosis, error containment/isolation, and error recovery. (5) There are two types of recovery: (1) backward recovery and (2) forward recovery. (6) Four types of redundancy for fault tolerance: (1) hardware redundancy, (2) software redundancy, (3) Information or data redundancy, and (4) temporal redundancy. ©Francis Palma, Lakehead University, 2017 18 of 19
Checking the Understanding
Right or Wrong? 1. An incorrect statement in a requirements document is often caused by a human mistake. 2. A defect is also known as an error. 3. Bug and fault are synonyms. 4. Design errors can lead to the wrong data stored in a database. 5. A coding mistake is one example of a software failure. 6. An incorrect total in a printed report is an example of an error. 7. Incorrect logic statements in a program are examples of defects. 8. When a system stops unexpectedly, it is called a failure. ©Francis Palma, Lakehead University, 2017 19 of 19
Questions?
Information or Data Redundancy • Diverse data (not simple redundant copies) can be used for tolerating software faults • A data re-expression algorithm (DRA) produces different representations of a module's input data • This transformed data is input to copies of the module in data diverse software fault tolerance techniques ©Francis Palma, Lakehead University, 2017 18 of 19
Temporal Redundancy • Temporal redundancy commonly comprises repeating an execution using the same software & hardware resources involved in the initial, failed execution • Backward recovery schemes typically use a combination of temporal and software redundancy • Temporal redundancy is mainly used in human-interactive programs • Applications with hard real-time constraints are not suitable for using temporal redundancy ©Francis Palma, Lakehead University, 2017 19 of 19