5 Software Redundancy Reliable System Design 2010 by

5. Software Redundancy Reliable System Design 2010 by: Amir M. Rahmani

Software There are many kinds of software l System software • • • l – Operating system (Windows, Linux, Solaris, etc. ) – Device driver (for printer, graphic card, etc. ) – Compiler (gcc) – Library (DLLs) – Distributed system (software shared memory, Napster, etc. ) User-level software • – E. g. , simulator, word processor, spreadsheet, game, etc. matlab 1. ir

Software Faults/Errors l Operating system software (including device drivers) • • • l – Deadlock (may be able to escape with Ctrl-Alt-Delete) – Crash and reboot – Incorrect I/O User-level software • • – Deadlock (can escape with Control-C) – Incorrect algorithm – Array bounds violation – Memory leak (C, C++, but not Java) • Allocating memory, but not de allocating it • • – Reference to a NULL pointer (C, C++, but not Java) – Incorrect synchronization in multithreaded code • Allowing more than one thread in critical section at a time • Blocking when holding a lock • • – Inability to handle unanticipated inputs – Exception that triggers OS to kill process • Segmentation fault • Bus error matlab 1. ir

Specification vs. Implementation l l There are many, many techniques Problem is in two parts: • • l l l 1 - Correct specification, erroneous implementation 2 - Erroneous specification, correct implementation Both parts concern us as software engineers Both parts need attention - why a system fails is not important to the users Different techniques for the two parts matlab 1. ir

Static (Pre-Release) Fault Detection in Software l As with hardware, can try to find faults before shipping the product • • • l Can try to add in redundancy to mask potential faults • l * Design reviews * Formal verification (analysis of design) * Testing (analysis of implementation) * N-Version programming Can try to proactively “scrub” the software to remove latent errors (due to aging) before failures occur • * Software rejuvenation: It involves stopping the running software occasionally, “cleaning” its internal state (e. g. , garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting. E. g. periodically reboot system to flush out remaining latent problems due to aging matlab 1. ir

Static Fault Detection with Formal Verification l l l Formal verification is a systematic, mathematical way to prove that a system (software or hardware) is correct or incorrect Correctness is based on a specification Examples of mathematical objects often used to model systems are: • l l l finite state machines, l. Petri nets, timed automata, process algebra Two broad approaches formal verification – Theorem proving – Model checking matlab 1. ir

Formal Verification: Theorem Proving l l Theorem Proving consists of using a formal version of mathematical reasoning (logical inference) about the system. Example: theorem proving software such as a • • l Develop logical/mathematical equations that describe: • • • l HOL (Higher Order Logic) theorem prover, ACL 2 (A Computational Logic for Applicative Common Lisp) – System to be verified – Specification of correctness for the system In the rules of this logic/mathematics, prove that the system is equivalent to its specification Theorem proving is difficult for very large complex systems, but can work on small subsystems matlab 1. ir

Formal Verification: Model Checking l which consists of a systematically exhaustive exploration of the mathematical model (possible for finite models) • l l Develop logical/mathematical equations that describe required properties of the FSM Example properties: • • l – Never ends up in state X – Can reach every desired state in FSM Software has been developed to perform model checking logically, this is an exhaustive search • l Example: Describe system as finite state machine (FSM) – Example: Mur model checker (from Stanford) Similar to theorem proving, model checking is difficult for large complicated systems • – Algorithms tend to be exponential in number of states matlab 1. ir

Verification and Validation l Validation: "Are we trying to make the right thing? ", i. e. , is the product specified to the user's actual needs? Verification: "Have we made what we were trying to make? ", i. e. , does the product conform to the specifications? l Often refers to the overall checking process as V & V l matlab 1. ir

Software tools for Static Analysis l There are tools that can analyze software to determine if it has bugs • l Can check to see if: • • l – All code is reachable – Deadlock is possible Advantage of static analysis tools • l In most cases the analysis is performed on some version of the source code and in the other cases some form of the object code. – Checks all possible control flow paths through application can detect any possible specified problem, even if it would only occur very rarely in practice Disadvantages • • – Must have access to entire code base, e. g. , can’t deal with dynamically loaded libraries – Difficult to assess probability of error occurring in practice matlab 1. ir

Dynamic Fault Detection in Software l Must add code to check software as it is running • l l Added code = redundancy! Most common form of error detection: assertions • l – Unless you’re willing to wait for it to crash – E. g. , assert (Grade >= 0 && Grade <= 20) Challenges • • • – Knowing which invariants to check – Knowing when to check these invariants – Dealing with black box code (e. g. , libraries) matlab 1. ir

Automatic Dynamic Fault Detection with Meta-Compilation l l Recent research from Berkeley explores how to have the compiler automatically integrate error checking to code User can specify general high-level invariants Compiler automatically integrates invariant checking into the code Example • • 1 - 99% of lock_acquire() must have corresponding lock_release(). // so that other 1% is probably wrong 2 - if (ptr = = NULL){ printf(%d, ptr->data) // what’s wrong here? } matlab 1. ir

Other Forms of Dynamic Fault Detection l l Java has automatic array bounds checking, and it won’t let you write beyond the bounds of the array Operating system will not let an application process access memory that doesn’t belong to it. This is what is happening when you see “segmentation fault”! FTP software uses a checksum to make sure that the data that was received is the same as the data that was sent Other examples? matlab 1. ir

Self-Checking Code l Can we write software that checks that its output is • • Example: if we divide A/B = C, we can check the result by multiplying B*C. If B*C != A, then the division was incorrect. – Detects hardware faults (famous Pentium bug) – Detects software faults (assuming more complicated operation than just division, which is a single instruction) l l Key idea: checking a computation is always at least as easy as performing it (result from computational complexity theory) Other examples? Finding paper. matlab 1. ir

Hardware for Software Fault-Tolerance l l Difficult for HW to know that SW is in error, because HW doesn’t know what SW is trying to do Example • • • l – it’s unlikely that a program really wants to divide by zero – Any others? - Watchdog timer Current work at Duke is exploring hardware support for detecting starvation matlab 1. ir

Software for Hardware Fault-Tolerance l l Many examples of using software to tolerate HW faults In fact, all schemes for tolerating software errors will detect hardware errors that manifest themselves in the same way (i. e. , they have the same error model) • – E. g. , self-checking software will detect a hardware fault if it leads to an incorrect result • Example: if we divide A/B = C, we can check the result by multiplying B*C. If B*C != A, then the division was incorrect. matlab 1. ir

What is Software Fault Tolerance? l The term ”software fault-tolerance” can mean two things: 1. ”the tolerance of software faults”, or 2. ”the tolerance of faults by the use of software” s Definition 1 is more commonly used. The term ”software redundancy” corresponds to definition 2. l Remember: All software faults are design faults (Specification and Implementation mistakes)! matlab 1. ir

Cause-and-Effect Relationship matlab 1. ir

Software Redundancy l Software redundancy techniques can be divided in two major classes: • With diversity • • – Design or data diversity – Aim is to tolerate design faults • Without diversity • • – Implements error detection, recovery, etc – Aim is to handle errors of any origin (physical faults, design faults, operator faults) matlab 1. ir

Design Diversity l l Design diversity is used to tolerate design faults in hardware and software Two techniques for tolerating software design faults: • • • N-version programming • Recovery blocks matlab 1. ir

N-version programming l Heterogeneous redundancy • • l l l – TMR is homogeneous redundancy – Question? Why would TMR not work here? Uses majority voting on results produced by N program versions Program versions are developed by different teams of programmers Assumes that programs fail independently Look likes masking hardware redundancy Uses Forward Error Recovery matlab 1. ir

N-version programming matlab 1. ir

Achieving Version Independence-Diversity l l l l Different design teams for each version Diverse specifications Versions with differing capabilities Teams working on different modules are forbidden to directly communicate Diverse programming languages, development tools, compilers, hardware, operating systems and etc. Questions regarding ambiguities in specifications or any other issue have to be addressed to some central authority who makes any necessary corrections and updates all teams … matlab 1. ir

Causes of Version Correlation l l l Common specifications: errors in specifications will propagate to software Inherent difficulty of problem: algorithms may be more difficult to implement for some inputs, causing faults triggered by same inputs Common algorithms: algorithm itself may contain instabilities in certain regions of input space - different versions have instabilities in same region Cultural factors: programmers make similar mistakes in interpreting ambiguous specifications Common software and hardware platforms: if same hardware, operating system, and compiler are used - their faults can trigger a correlated failure matlab 1. ir

N-version programming depends on l l l Initial specification — The majority of software faults stem from inadequate specification? A specification error will manifest itself in all N versions of the implementation Independence of effort — Experiments produce conflicting results. Where part of a specification is complex, this leads to a lack of understanding of the requirements. If these requirements also refer to rarely occurring input data, common design errors may not be caught during system testing Adequate budget — The predominant cost is software. A 3 -version system will triple the budget requirement and cause problems of maintenance. Would a more reliable system be produced if the resources potentially available for constructing an N-versions were instead used to produce a single version? matlab 1. ir

Evaluation of N-version programming l l l Few experimental studies of effectiveness of N-version programming Published results only for work in universities. Program: Anti-missile application • 27 versions produced by students at University of Virginia and University of California at Irvine. Published in 1985. • Some had no prior industrial experience while others over ten years • All students was given the same specification • All versions were written in Pascal • 200 test cases to validate each program • 1 million test cases to test independence (simulation of production • 93 correlated faults were identified by standard statistical hypothesis-testing methods • No correlation observed between quality of programs produced and experience of programmer • • matlab 1. ir

Recovery Block l l l N-versions; one running - if it fails, execution is switched to a backup Uses one primary software module and one or several secondary (back-up) software modules Assumes that program failures can be detected by acceptance tests Executes only the primary module under errorfree conditions Look likes dynamic hardware redundancy matlab 1. ir

Recovery Block matlab 1. ir

Recovery Block Mechanism Restore Recovery Point Fail Establish Recovery Point Any Yes Alternatives Left? Execute Next Alternative No Fail Recovery Block matlab 1. ir Evaluate Pass Discard Recovery Acceptance Test Point

Recovery Block Format l l Acceptance test is provided to check if answers are reasonable Format: ensure acceptance test by primary module else by first alternative else by second alternative …. else error matlab 1. ir

Example: Solution to Differential Equation ensure Rounding_err_has_acceptable_tolerance by Explicit Kutta Method else by Implicit Kutta Method else error l l Explicit Kutta Method fast but inaccurate when equations are stiff Implicit Kutta Method more expensive but can deal with stiff equations • • - The above will cope with all equations - It will also potentially tolerate design errors in the Explicit Kutta Method if the acceptance test is flexible enough matlab 1. ir

Construction of Acceptance Tests l l l l An acceptance test is a software implemented check designed to detect errors in the results produced by a primary or a secondary module The design of the acceptance test is crucial to the efficacy of the Recovery Block scheme Acceptance tests often relies on application specific information All the previously discussed error detection techniques discussed can be used to form the acceptance tests There is a trade-off between providing comprehensive acceptance tests and keeping overhead to a minimum, so that fault-free execution is not affected Note that the term used is acceptance not correctness; this allows a component to provide a degraded service However, care must be taken as a faulty acceptance test may lead to residual errors going undetected Success of Recovery Block approach depends on failure independence of different versions (modules) and quality of acceptance test matlab 1. ir

Examples of how acceptance can be constructed l l Satisfaction of requirements (Structural checks) • • • Inversion of mathematical functions; e. g. squaring the result of a square-root operation to see if it equals the original operand. • Checking sort functions; result should have elements in descending order matlab 1. ir

Examples of how acceptance can be constructed l Reasonable checks • • l • Checking physical constraints; e. g. speed, pressure, etc • Checking sequence of application states Structural checks • • Structural checks are based on known properties of data structures – a number or elements in a list can be counted, or links and pointer can be verified matlab 1. ir

Evaluation of Recovery Blocks Naval command control system (8000 statements in the Coral language) l 117 abnormal events • • • Correct recovery Incorrect recovery, program failure Incorrect recovery, no program failure Unnecessary recovery 78 % 3% 15 % 3% Anderson, T. , et al. , ”Software Fault Tolerance: An Evaluation, ” IEEE Trans. on Software Engineering, vol. SE-11, no. 12, Dec 1985, pp. 1502 -1510. matlab 1. ir

N-Version vs. Recovery Block l N-version programming • • • l • Applied at the program level • Runs N programs at the same time • Look likes static hardware redundancy • Vote comparison (error masking) • Assumes that independence among program versions is achieved by random differences in programming style among programmers Recovery block • • • Applied at the module (subprogram) level • Runs only the primary module under error-free conditions • Look likes dynamic hardware redundancy • Error detection : acceptance test • Independence is achieved by intentionally designing the primary and secondary modules to be as different as possible (different algorithms) matlab 1. ir

Data Diversity l l This technique is cheaper to implement than the design diversity tecghnique. Popular techniques which are based on the data diversity concept for fault tolerance in software are: • • • Retry blocks • N-copy programming matlab 1. ir

Retry Blocks l l l A retry block is a modification of the recovery block structure that uses data diversity instead of design diversity (data and re-expressed data like complement of data). Rather than the multiple alternate algorithms used in a recovery block, a retry block use only one algorithm. A retry block's acceptance test has the same form and purpose as a recovery block's acceptance test. matlab 1. ir

N-Copy Programming l l l An N-copy programming is similar to an Nversion programming but uses data diversity instead of design diversity. N copies of a program execute in parallel, each on a set of data produced by re-expression. The system selects the output to be used by an enhanced voting scheme. matlab 1. ir

Airbus A 330 l l l National origin Manufacturer First flight Status Primary users • • Multi-national Airbus 2 November 1992 In production, in service Cathay Pacific Delta Air Lines Qatar Airways Emirates l Produced Number built Unit cost l http: //en. wikipedia. org/wiki/Airbus_A 330 (3 Nov. 2013) l l 1993–present 1, 016 as of 10 October 2013 A 330 -300, € 215 million(2011) matlab 1. ir

A 340 l l l National origin Manufacturer First flight Status Primary users • • l l l Multi-national Airbus 25 October 1991 Out of production, in service Lufthansa Iberia, South African Airways Virgin Atlantic Airways Produced Number built Unit cost 1993 -2011 375 A 340 -600: US$275. 4 million matlab 1. ir

Design Diversity in Airbus A 330/A 340 l Two types of computers • • l • 3 primary computers • 2 secondary computers Each computer are internally duplicated and consists of two channels • • • Command channel • Monitor channel matlab 1. ir

Architecture for A 330/A 340 Flight control primary computers secondary computers matlab 1. ir Flight control data concentrators

Design Diversity in Airbus 330/A 340 l Implementation of primary computers • • • l • Supplier: Aerospatiale (HW&SW) • Hardware: Two Intel 80386 (one for each channel) • Software: assembler for command channel, PL/M for monitor channel. Implementation of secondary computers • • Supplier: Sextant Avionique (HW), Aerospatiale (SW) • Hardware: Two Intel 80186 (one for each channel) • Software: assembler for command channel, Pascal for monitor channel. matlab 1. ir

Exception Handling l l l l Exception indicates that something happened during execution that needs attention Control is transferred to an exception-handler - routine which takes appropriate action Example: When executing y = a*b, if overflow => result incorrect => signal an exception Effective exception-handling can make a significant improvement to system fault tolerance Over half of code lines in many programs are devoted to exception-handling Exception handling is a Forward Error Recovery mechanism, as there is no roll back to a previous state; instead control is passed to the handler so that recovery procedures can be initiated However, the exception handling facility can be used to provide Backward Error Recovery matlab 1. ir

Example: Domain and Range Failure l Exceptions can be used to deal with • • • l l - domain or range failure - out-of-ordinary event (not failure) needing special attention - timing failure A domain failure happens when illegal input is used Example: if X, Y are real numbers and X = √Y is attempted with Y = -1, a domain failure occurs A range failure occurs when program produces an output or carries out an operation that is seen to be incorrect in some way Examples include: • • - Encountering an end-of-file while reading data from file - Producing a result that violates an acceptance test - Trying to print a line that is too long - Generating an arithmetic overflow or underflow matlab 1. ir

Timing Failure l l Timing Checks: Timing checks are an effective form of software check for detecting errors even in cases of running programs in a dual redundant execution mode, if the specification of a component includes timing constraints. Watch-dog timer • • • - is used to guard against program hang-ups. - Also used in communications between CPU and main store. -Also used in periodic "hello" exchanges (network surveillance) and in I/O operations matlab 1. ir
- Slides: 47