A survey of dependability patterns Ingrid Buckley and

  • Slides: 18
Download presentation
A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer

A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering Florida Atlantic University Boca Raton, FL, USA January 18, 2007 Secure Systems Research Group - FAU 1

Introduction Dependability is that property of a system that allows one to rely on

Introduction Dependability is that property of a system that allows one to rely on its service Dependability for critical systems is of utter importance in business and critical infrastructures such as hospitals, airport and the electricity grid of a country. Dependability is comprised of several pertinent aspects: • • Fault Tolerance Safety Availability Reliability Secure Systems Research Group - FAU 2

Introduction cont’d • Fault Tolerance as it relates to systems, software and hardware is

Introduction cont’d • Fault Tolerance as it relates to systems, software and hardware is the ability to remain operable in the presence of faults. • Safety is the prevention of catastrophic effects on the environment or the users of the system • Availability is the ability of a system to perform its functions when needed. • Reliability measures the success with which the system conforms to its specification. • We use the Unified Modeling Language (UML), to represent fault tolerance patterns. Secure Systems Research Group - FAU 3

Objectives • Classify software and hardware fault tolerance patterns according to their objectives •

Objectives • Classify software and hardware fault tolerance patterns according to their objectives • Analyze and evaluate the classified fault tolerance patterns • Determine how to improve upon existing patterns. • Design new fault tolerance patterns for unsupported areas within critical systems. Secure Systems Research Group - FAU 4

Background • A pattern is an encapsulated solution to a recurrent problem that solves

Background • A pattern is an encapsulated solution to a recurrent problem that solves a specific problem in a given context and can be tailored to fit different situations. • A fault is a defective value in the state of a component or in the design of a system; a fault is the manifestation of an error. An error is a defective value in an erroneous state of a system • A system failure occurs when there is a deviation from the system’s specification. A failure is the manifestation of an error. • The System Development Life Cycle (SDLC) is the entire process of formal, logical steps taken to develop software. Secure Systems Research Group - FAU 5

Fault Tolerance • A system that can mask the effects of a fault and

Fault Tolerance • A system that can mask the effects of a fault and continue operating correctly is said to be fault tolerant. • Fault tolerance requires redundancy and diversity which are directly linked to reliability and support availability of a system. • Diversity in this sense speaks of having different versions of a function or system where all have the same functionality. • The integration of hardware and software fault tolerance to cope with the various kinds of faults that can appear in a software system is a good foundation towards achieving a fault tolerant system. • There are several fault tolerance patterns that have already been written and support different levels of the system architecture. Our aim is to focus on hardware and software fault tolerant patterns. Secure Systems Research Group - FAU 6

Fault Tolerance Cont’d • • • Fault Tolerance patterns are a fairly new area

Fault Tolerance Cont’d • • • Fault Tolerance patterns are a fairly new area in association with critical systems , the need for them has increased with the need to secure systems against failure caused accidentally or intentionally by attackers. Due to the diversity of attacks on different types of systems, it is highly important to have effective fault tolerance techniques to mitigate faults that may lead to a failure in a critical system. To prevent failures the following is required: – Detection - Detecting the occurrence of errors – Locating the unit or component where the error has occurred (diagnosis). – Masking- masking errors so as to prevent malfunctioning of the system if a fault occurs. – Containment of faults -Confine or delimit the effects of the error. – Recovery- Reconfigure the system to remove the faulty unit and erase the effects of the error. Secure Systems Research Group - FAU 7

Hardware Fault Tolerant Patterns Hardware fault tolerance applies hardware replication to enhance the system

Hardware Fault Tolerant Patterns Hardware fault tolerance applies hardware replication to enhance the system availability/reliability in the presence of hardware faults. • Hardware Fault Tolerance patterns: -The Watch Dog pattern primarily provides protection against time-based faults by creating an alarm whenever liveness messages are not received in a given time frame. Secure Systems Research Group - FAU 8

Hardware Fault Tolerant Patterns Cont’d – Fail Stop Processor : The Fail-Stop Processor pattern

Hardware Fault Tolerant Patterns Cont’d – Fail Stop Processor : The Fail-Stop Processor pattern mainly aims at transforming errors that lead to Byzantine/complex failures, and is based on redundancy and comparing output from all replicas to reach an agreement. – Acknowledgement : The Acknowledgement pattern detects crash failures and is based on acknowledging the reception of input within a given time interval. Secure Systems Research Group - FAU 9

Software Fault Tolerant Patterns • Software fault tolerance applies software redundancy by means of

Software Fault Tolerant Patterns • Software fault tolerance applies software redundancy by means of diversity of design to tolerate software faults that can occur at the design, programming or maintaining phases of the software development cycle. Software Fault Tolerance patterns: – Roll forward : The Roll Forward pattern is a failure recovery pattern which detects and recovers from a fault by monitoring two replicas for errors. Secure Systems Research Group - FAU 10

Software Fault Tolerant Patterns Con’t – Input Guard : Input Guard pattern stops erroneous

Software Fault Tolerant Patterns Con’t – Input Guard : Input Guard pattern stops erroneous input from propagating the error inside a component. A guard is placed at every access point of the component to check the validity of the input. – Fault Container : The Fault Container pattern provides the same benefits as the combination of the Input Guard and the Output Guard patterns, because it prevents an error from being propagated inside and outside a given component. Secure Systems Research Group - FAU 11

Hardware/Software Fault Tolerance Pattern • The Software Redundancy Pattern deals with hardware, software and

Hardware/Software Fault Tolerance Pattern • The Software Redundancy Pattern deals with hardware, software and environmental faults at the same time. Secure Systems Research Group - FAU 12

Patterns diagram for the fault tolerance domain Secure Systems Research Group - FAU 13

Patterns diagram for the fault tolerance domain Secure Systems Research Group - FAU 13

Analysis of Patterns Pattern Advantage Disadvantage Watchdog • Can be used improve deadlock detection,

Analysis of Patterns Pattern Advantage Disadvantage Watchdog • Can be used improve deadlock detection, where strokes can be keyed or contains data to identify strokes from different computational steps. • Acknowledgement • The design complexity introduced by the is very low. • Does not provide means to tolerate faults in a system. Rather, it provides means detect errors. • It introduces relatively elevated space overhead that is proportional to the number of simultaneous errors it can deal with • Does not introduce any space overhead Fail Stop Processor • Introduces low time overhead since the processors function in parallel • The processors are replicas of the original system on which the Fail-Stop Processor pattern is applied, without any additional functionality. meaning that in practice the processors can be replicas of a legacy system, which cannot be subject to any internal changes Does not actually checks that the internal computation processing is correct • The error on the monitored system is detected only after some input has been issued to it. • The timeout must be set based on the time it takes for the input to reach the monitored system plus the time it takes for the acknowledge to reach monitoring system. such as those that are needed if additional functionality would be required by the processors. Secure Systems Research Group - FAU 14

Analysis of Patterns Cont’d Pattern Advantage Disadvantage • The time overhead imposed by this

Analysis of Patterns Cont’d Pattern Advantage Disadvantage • The time overhead imposed by this pattern is low when errors occur: the failed replica is discarded, and the unaffected replica processes Roll Forward the subsequent inputs . • The time overhead imposed by this pattern in the absence of errors is high; before the replica Is able to receive and process new input, it must copy its new state to the other replica. Input Guard • It stops the contamination of the guarded component from erroneous input that does not conform to the specification of the guarded component. • There are various ways that the Input Guard pattern can be implemented, each providing different benefits with respect to the time or space overhead introduced by the guard. • Cannot prevent the propagation of errors that do conform with the specification of the guarded component. • Has significant time and space over head Fault Container • It stops of errors expressed as input and output content or timing that does not conform to a component specification from entering or exiting that component. • The undefined behavior of the container in the presence of errors allows its combination with error detection and error masking patterns • The Fault Container pattern cannot prevent the propagation of errors that do not conform with the specification of the contained component. • Unless combined with some error detection and system recovery mechanisms, this pattern will result in send- or receive-omission failures (i. e. failure to send output or receive input of the contained component). Secure Systems Research Group - FAU 15

Conclusion • There is a need to improve upon current Fault Tolerant Patterns based

Conclusion • There is a need to improve upon current Fault Tolerant Patterns based on our analysis. • New Fault Tolerance Patterns are necessary to provide dependability in distributed systems because many of the fault Tolerance patterns are very similar and do not provide a comprehensive support for errors that can lead to failure. Secure Systems Research Group - FAU 16

Future Work • Safety, Availability and Reliability Patterns being researched. • Defining areas of

Future Work • Safety, Availability and Reliability Patterns being researched. • Defining areas of need where current Fault Tolerance Patterns are lacking or require improvement. • Designing new Fault Tolerance Patterns. Secure Systems Research Group - FAU 17

Recommendations and Questions Feed back: Secure Systems Research Group - FAU 18

Recommendations and Questions Feed back: Secure Systems Research Group - FAU 18