Software Dependability CIS 376 Bruce R Maxim UMDearborn

Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn

Dependability • The extent to which a critical system is trusted by its users • Dependability is usually the most important system property of a critical system • A system does not have to be trusted to be useful • Dependability reflects the extent of the user’s confidence that it will not fail in normal operation

Dimensions of Dependability • Availability – ability of the system to deliver services when requested • Reliability – ability of the system to deliver services specified • Safety – ability of system to operate without catastrophic failure • Security – ability of system to defend itself against intrusion

Maintainability • Concerned with the ease of repairing a system after failure • Many critical system failures are caused by faults introduced during maintenance • Maintainability is the only static dimension of dependability, the other 3 are dynamic

Survivability • Ability of a system to deliver services after a deliberate or accidental attack • This is very important for distributed systems whose security can be compromised • Resilience – ability of system to continue operation despite component failures

Dependability Costs • Tend to increase exponentially as increasing levels of dependability are required • More expensive development techniques and hardware required to achieve higher levels of reliability • Increased testing and validation are required to convince users that higher levels of dependability have been achieved

Dependability and Performance • Untrustworthy systems are rejected by users • System failure costs may be high • It is hard to make existing systems more dependable • It may be possible to compensate for poor performance • Untrustworthy systems may lead to information loss

Dependability Economics • Sometimes it is more cost effective to pay for failures than try to improve dependability • having a reputation for products that can’t be trusted can lead to loss of business • System trustworthiness levels depend on the system type being developed

Availability and Reliability • Availability – probability of failure-free operation over a specified time period in a given environment for a given purpose • Reliability – probability that a given system will be operational at a given point in time and able to deliver services

Comparing Availability and Reliability • If a system is not available when it is needed it is unreliable • It is possible to have systems with low reliability and high availability (if failures can be repaired quickly and do not damage data) • Availability must take repair time into account

Faults and Failures • Failures are usually the result of system errors derived from system faults • Faults do not always result in system failure – a transient system state is corrected before error occurs • Errors do not always leads to system failures – an error can be corrected by built-in error detection and recovery procedures – failure can be protected against by protecting system resources from damage

User’s Reliability Perceptions • The formal definition of reliability may not reflect the user’s perception of reliability – the users environment may not match the developers assumptions about the application environment • The consequences of failure affect the user’s perception of reliability – failures with serious consequences are given more weight by users than failures that are inconvenient

Reliability Achievement • Fault Avoidance – development techniques that minimize the possibility of mistakes or reduce the consequences of errors • Fault Detection and Removal – verification and validation techniques that increase the possibility of detecting and correcting errors before deployment • Fault Tolerance – run-time techniques used to ensure system faults do not result in system error and system errors do nor result in system failures

Reliability Modeling • You can model a system as an input-output mapping where some inputs lead to erroneous outputs • The reliability of the system is the probability that a particular input lies in the set of inputs which cause erroneous outputs • This probability is not static and depends on the system’s environment

Improving Reliability • Removing X% of the system faults does not always improve system reliability – remember the 90/10 rule • Program defects may lie in code rarely executed by the user, so removing them will do little to improve perceived reliability • A program with known faults may still be perceived by its users as reliable

Safety • System property that reflects the system’s ability to operate (normally or abnormally) without danger to system environment • As more devices become software controlled, safety becomes a greater concern • Safety requirements are exclusive (they exclude undesirable situations rather than specify required system services)

Safety Criticality • Primary safety-critical systems – embedded software systems whose failure can cause associated hardware to fail and directly threaten people • Secondary safety-critical systems – systems whose faults can cause other systems to fail which cause threaten people

Safety and Reliability • They are related, but not identical • Reliability – concerned with conformance to a specification and delivery of a service • Safety – concerned with ensuring a system cannot damage, regardless of its conformance (or nonconformance) to its specification

Unsafe Reliable System • Specification errors – if the specification is incorrect conformance to the specification can still cause damage • Hardware failures generating spurious outputs – hard to anticipate in specification • Context-sensitive commands – e. g. issuing the right command at the wrong time – often caused by operator error

Safety Achievement • Hazard Avoidance – system design so some hazard cases can not arise • Hazard Detection and Removal – system design so hazards are detected and removed before they result in an accident • Damage Limitation – system includes protection features that minimize damage that may result from an accident

Accidents • Rarely have a single cause in a complex system (e. g. credit assignment problem) • Most accidents are the result of combinations of malfunctions • Anticipating all combination of malfunctions may not be possible in a software controlled system, so complete safety may be impossible

Security • Reflects a system’s ability to protect itself from attack • Security is increasingly important when systems are networked to each other • Security is an essential pre-requisite for availability, reliability, and safety

Fundamental Security • If a system is networked and insecure then statements about it reliability and safety are unreliable • Intrusion (attack) can change the system’s operating environment or data and invalidate the assumptions upon which the reliability and safety are made

Insecurity Damage • Denial of Service – system forced into state where providing service is impossible or significantly degraded • Corruption of Programs or Data – modifications made by unauthorized user • Disclosure of Confidential Information – information managed by system is exposed to people who are not authorized users

Security Assurance • Vulnerability Avoidance – system designed so vulnerabilities can not occur – e. g. no network connection • Attack Detection and Elimination – system designed so attacks on vulnerabilities do not occur – e. g. use of anti-virus software • Exposure Limitation – system designed so damage from attacks is minimal – e. g. a backup policy that allows restoration of damaged files