EDCC8 28 April 2010 Valencia Spain 28 Valencia

  • Slides: 25
Download presentation
EDCC-8 28 April 2010, Valencia, Spain 28 Valencia, Spain Mobi. Lab The 8 th

EDCC-8 28 April 2010, Valencia, Spain 28 Valencia, Spain Mobi. Lab The 8 th European Dependable Computing Conference Emulation of Transient Software Faults for Dependability Assessment: A Case Study Roberto Natella, Domenico Cotroneo {roberto. natella, cotroneo}@unina. it www. mobilab. unina. it The Mobi. Lab Group Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II Via Claudio 21, 80125 - Napoli, Italy cotroneo@unina. it

: : . Outline EDCC-8 28 April 2010 Valencia, Spain 2 / 22 Mobi.

: : . Outline EDCC-8 28 April 2010 Valencia, Spain 2 / 22 Mobi. Lab v Context and problem statement v. Software Fault Injection v. Bohrbugs and Mandelbugs v Case study from the ATC domain v Evaluation of state-of-the-art fault injection v An experiment involving concurrency faults v Conclusions www. mobilab. unina. it cotroneo@unina. it

: : . Rationale (1/2) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 3

: : . Rationale (1/2) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 3 / 22 t Software faults represent an important cause of system failures t Despite of efforts on Verification activities, fault avoidance, and fault removal, software systems are often delivered with residual software faults t Critical systems adopt Fault Tolerance Mechanisms (FTMs) to avoid failures at run-time www. mobilab. unina. it cotroneo@unina. it

: : . Rationale (2/2) EDCC-8 28 April 2010 Valencia, Spain 4 / 22

: : . Rationale (2/2) EDCC-8 28 April 2010 Valencia, Spain 4 / 22 Mobi. Lab t FTMs: A few examples § Spatial redundancy • CORBA FT, TANDEM 90 Process Pairs § Temporal redundancy • Checkpointing and rollback t Software Fault Injection (SFI) is a valuable approach for the verification and the improvement of FTMs t To correctly emulate software faults, we need to understand their features www. mobilab. unina. it cotroneo@unina. it

: : . Software Faults EDCC-8 28 April 2010 Valencia, Spain 5 / 22

: : . Software Faults EDCC-8 28 April 2010 Valencia, Spain 5 / 22 Mobi. Lab t Bohr. Bugs § Faults whose activation is reproducible, i. e. , it is straightforward to identify its activation pattern § Typically detected and then fixed during testing phase t Mandel. Bugs § Faults whose activation is transient and not systematically reproducible § Their activation conditions depend on complex combinations of user inputs, the internal state and the external environment www. mobilab. unina. it cotroneo@unina. it

: : . Problem statement Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 6

: : . Problem statement Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 6 / 22 t Mandelbugs represent the major cause of failure in mission-critical system §. . up to 82 % in well-tested software [2] [5] [6] t Mandelbugs are typically tolerated by the adoption of several redundancy schemes t Are existing SFI techniques able to emulate Mandelbugs adequately? t How should Mandelbugs be emulated? www. mobilab. unina. it cotroneo@unina. it

: : . Software Fault Injection (SFI) Mobi. Lab EDCC-8 28 April 2010 Valencia,

: : . Software Fault Injection (SFI) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 7 / 22 t To date, representativeness of injected faults has still not been investigated with respect to: § Fault manifestation; § Their effectiveness in testing FTMs (i. e. , to emulate faults that most often occur and that they should tolerate) www. mobilab. unina. it cotroneo@unina. it

: : . Contributions Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 8 /

: : . Contributions Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 8 / 22 t We aim to investigate this issue with a simple experimental campaign but… …. in a complex and real-world software sytems § We evaluated G-SWFIT, with respect to Mandelbugs § We compared the results with an experiment, specifically designed to emulate Mandelbugs t Case study: a fault-tolerant system from the Air Traffic Control (ATC) domain § It is a Flight Data Processor (FDPS) based on a CORBA-compliant middleware www. mobilab. unina. it cotroneo@unina. it

: : . Case study (1/2) Mobi. Lab www. mobilab. unina. it EDCC-8 28

: : . Case study (1/2) Mobi. Lab www. mobilab. unina. it EDCC-8 28 April 2010 Valencia, Spain 9 / 22 cotroneo@unina. it

EDCC-8 28 April 2010 Valencia, Spain : : . Case study (2/2) 10 /

EDCC-8 28 April 2010 Valencia, Spain : : . Case study (2/2) 10 / 22 Mobi. Lab t We modeled the FDPS as a FSM to support the analysis of faults t A state consists of the following internal variables: 1) The number of FDP requests queued by the Façade 2) The number of requests under processing 3) The number of requests queued by Processing Servers (PSs) t CR, FR, PSC, …, are the messages exchanged in the FDPS www. mobilab. unina. it cotroneo@unina. it

: : . Experimental campaign using G-SWFIT (1/4) EDCC-8 28 April 2010 Valencia, Spain

: : . Experimental campaign using G-SWFIT (1/4) EDCC-8 28 April 2010 Valencia, Spain 11 / 22 Mobi. Lab t We implemented G-SWFIT fault operators in an open-source fault injection tool t The tool analyzes a C/C++ source code file, to produce a set of faulty source files § Freely available at: http: //www. mobilab. unina. it/SFI. htm + C preprocessor C/C++ Source Files www. mobilab. unina. it C/C++ frontend 2 ÷ 6 3 Abstract Syntax Tree Fault Injector int main() { if( a && b ) { c++; } } Patch Files (with faults) cotroneo@unina. it

: : . Experimental campaign using G-SWFIT (2/4) EDCC-8 28 April 2010 Valencia, Spain

: : . Experimental campaign using G-SWFIT (2/4) EDCC-8 28 April 2010 Valencia, Spain 12 / 22 Mobi. Lab t 533 faults have been injected in the Façade source code t 1599 experiments (3 different workloads) t For each experiment, we collected: • Information about a failure (e. g. , Façade crash, switch to the backup, missed FDP requests) • The state in which a failure occurred • The state in which the fault was activated www. mobilab. unina. it cotroneo@unina. it

: : . Experimental campaign using G-SWFIT (3/4) EDCC-8 28 April 2010 Valencia, Spain

: : . Experimental campaign using G-SWFIT (3/4) EDCC-8 28 April 2010 Valencia, Spain 13 / 22 Mobi. Lab t G-SWFIT is useful to test important system states (e. g. , the checkpointing mechanism) t However, faults did not emulate well Mandelbugs because: § A great amount of faults (56%) manifest themselves during Façade initialization or during the first request (state 0: 0: 0); but Mandelbugs usually manifest themselves during the operational phase of a system www. mobilab. unina. it faults activated failures cotroneo@unina. it

: : . Experimental campaign using G-SWFIT (4/4) Mobi. Lab t However, faults did

: : . Experimental campaign using G-SWFIT (4/4) Mobi. Lab t However, faults did not emulate well Mandelbugs because (CONTINUED): § In most of cases (93%) in which the backup Façade is activated, the backup also fails (i. e. , fault activation is simple to reproduce, like Bohrbugs) § Some important states (potentially affected by Mandelbugs) are untested (e. g. , when one or more requests are queued by the PSs) § State coverage: 65% www. mobilab. unina. it EDCC-8 28 April 2010 Valencia, Spain 14 / 22 cotroneo@unina. it

: : . Concurrency fault emulation (1/3) Mobi. Lab EDCC-8 28 April 2010 Valencia,

: : . Concurrency fault emulation (1/3) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 15 / 22 t To emulate Mandelbugs, we analyzed the scientific literature on software faults t We identified the following fault triggers: § Concurrency § Timing of external events § Wrong memory state § Faulty error handlers § Complex input sequences § Software aging www. mobilab. unina. it cotroneo@unina. it

: : . Concurrency fault emulation (2/3) Mobi. Lab EDCC-8 28 April 2010 Valencia,

: : . Concurrency fault emulation (2/3) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 16 / 22 t Features of most frequent concurrency faults (from a field data study [29]): § They are atomicity-violation faults (49%) § Only 1 shared variable is involved (66%) § At most 2 threads are needed to trigger the fault (90%) t Our fault model: § 2 threads access to a shared variable without acquiring a lock (race condition) www. mobilab. unina. it cotroneo@unina. it

: : . Concurrency fault emulation (3/3) Mobi. Lab EDCC-8 28 April 2010 Valencia,

: : . Concurrency fault emulation (3/3) Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 17 / 22 t We propose a fault emulation technique in two phases: t Fault injection: § Collects information about critical regions and their memory accesses § Removes lock operations before and after a pair of conflicting critical regions t Trigger injection : § Submits an input sequence to drive the system to a target state § Schedules 2 threads such that memory accesses interfere with each other www. mobilab. unina. it cotroneo@unina. it

: : . Preliminary system characterization 18 / 22 Mobi. Lab t Focusing on

: : . Preliminary system characterization 18 / 22 Mobi. Lab t Focusing on fault triggering we have to profile the system to recognize (and then to drive) the operating state t An input is associated to: § A sequence of messages sent and received by the Façade § A sequence of lock and memory accesses www. mobilab. unina. it EDCC-8 28 April 2010 Valencia, Spain sent by the tester cotroneo@unina. it

: : . How to trigger a fault? Mobi. Lab EDCC-8 28 April 2010

: : . How to trigger a fault? Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 19 / 22 t An algorithm identifies (i) which inputs to send and (ii) in which state to send inputs to trigger a fault t The algorithm exploits preliminary information to match messages with shared memory accesses www. mobilab. unina. it cotroneo@unina. it

: : . An example of concurrency fault (1/2) EDCC-8 28 April 2010 Valencia,

: : . An example of concurrency fault (1/2) EDCC-8 28 April 2010 Valencia, Spain 20 / 22 Mobi. Lab An algorithm Lock. CR The Thread CRQ operation processes 2 1 input is blocked writes reads isis omitted the sent an the sent. FSM to find when(faulty) to sendvalue inputs (next slide) inconsistent value www. mobilab. unina. it cotroneo@unina. it

: : . Experimental campaign using concurrency faults Mobi. Lab EDCC-8 28 April 2010

: : . Experimental campaign using concurrency faults Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 21 / 22 t 4 injected concurrency faults lead to a failure of primary Façade and not the backup one t We covered 13 out of 14 states (93%) in which faults were injected t Cumulative state coverage: 95% § In particular, states *: 3: 1 were tested (i. e. , one or more requests queued by PSs) www. mobilab. unina. it cotroneo@unina. it

: : . Lessons Learned Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 22

: : . Lessons Learned Mobi. Lab EDCC-8 28 April 2010 Valencia, Spain 22 / 22 t Are existing SFI techniques able to emulate Mandelbugs adequately? § No, G-SWFIT should be complemented by taking into account Mandelbugs t How should Mandelbugs be emulated? § Our solution is to identify most common fault triggers, and to try to emulate them in addition to modifying the source code www. mobilab. unina. it cotroneo@unina. it

EDCC-8 28 April 2010 Valencia, Spain : : . Mobi. Lab Thank you! Any

EDCC-8 28 April 2010 Valencia, Spain : : . Mobi. Lab Thank you! Any questions? www. mobilab. unina. it cotroneo@unina. it

EDCC-8 28 April 2010 Valencia, Spain : : . Mobi. Lab Backup slides www.

EDCC-8 28 April 2010 Valencia, Spain : : . Mobi. Lab Backup slides www. mobilab. unina. it cotroneo@unina. it

: : . G-SWFIT fault operators EDCC-8 28 April 2010 Valencia, Spain Mobi. Lab

: : . G-SWFIT fault operators EDCC-8 28 April 2010 Valencia, Spain Mobi. Lab t G-SWFIT fault operators were derived from a field data study [14] : -D t Fault activation and manifestation were neglected due to lack of data www. mobilab. unina. it cotroneo@unina. it