Safety Reliability and Robust Design in Embedded Systems

Safety, Reliability, and Robust Design in Embedded Systems

Risk analysis: managing uncertainty GOAL: be prepared for whatever happens Risk analysis should be done for ALL PHASES of a project: ---planning phase ---development phase ---the product itself Identify risks: What could you have done during the planning stage to manage each of these “risks”? How likely is it (what is probability) each one will occur? How likely is it (what is probability) more than one will occur? What actions will best manage the risk if it occurs?

risk management—identify, plan for risks During planning, a Risk Table can be generated: Risks Type* Probability Impact Plan (Pointer) System not available Hardware failure Color printer unavailable Personnel absent (one meeting) Personnel unavailable (several meetings) Personnel have left project *Type: Performance (product won’t meet requirements); Cost (budget overruns); Support (project can’t be maintained as planned); Schedule (project will fall behind) Probability: of this risk occurring Impact: e. g. , catastrophic, critical, marginal, negligible

risk management—identify, plan for risks Then table is sorted by probability and impact and a “cutoff line” is defined. Everything above this line must be managed (with a management plan pointed to in the last column). Useful reference: Embedded Syst. Prog. Nov. 00 --examples: http: //www. embedded. com/2000/0011 feat 1. htm Additional interesting reference: H. Petroski, To Engineer is Human: The Role of Failure in Successful Design, Vintage, 1992. .

professional risk analysis is proactive, not reactive

Important concepts for embedded systems: : Risk = (Probability of failure) * Severity Increased risk decreased safety Safety failures—possible causes: incorrect or incomplete specification bad design improper implementation faulty component improper use RELIABILITY: “what is the probability of failure? ”

Some ways to determine reliability: --product performs consistently as expected --MTBF (mean time between failures) is long --system behavior is DETERMINISTIC --system responds or FAILS GRACEFULLY to out-of-bounds or unexpected conditions and recovers if possible

Definitions: Fault: incorrect or unacceptable state or condition Fault duration and frequency determines clasification: transient—from unexpected external condition-”soft” intermittent—unstable hardware or marginal design periodic / aperiodic permanent—failed component, e. g. —”hard” Error: static, inherent characteristic of system Failure: dynamic, occurs at specific time Possible fault consequences: inappropriate action timing—event occurs too early or too late sequence of events incorrect quantity—wrong amount of energy or substance used

Achieving reliability: safe design fault detection fault management fault tolerant—system recovers, fault not detected e. g. , packet transfers èDefinition of reliability for embedded system: probability that a failure is detected by the user is less than a specified threshold

Examples—section 8. 5—read these carefully! Ariane 5 rocket: register overflow— 64 -bit word assigned to 16 bit register in a reused subsystem Mars Pathfinder mission 1997—lower priority tasks were allowed to hog resources, higher priority tasks could not execute 2004 Mars mission—file management problems Many more examples in articles at embedded. com

How do we define safety? One criterion: “single point”: failure of a single component will not lead to unsafe condition “common-mode failure”: failure of multiple components due to a single failure event will not lead to an unsafe condition Safety must be considered THROUGHOUT the project

Embedded system design—project components fig_08_00 Development process (“waterfall model”): Analysis Design Implement Test Maintain Alternative process models: Need risk analysis AT EACH INCREMENT (A=analysis, D=design, I=implement, T=test, M=maintenance) Basic waterfall model: A-->D-->I-->T-->M Prototyping: A-->D-->I-->T-->M Incremental: A-->D-->I-->T-->M-->A-->D-->I-->T--> ……-->M Component based: A-->D-->Library-->Integrate-->T-->M I

Specifications: Identify hazards Calculate risk Define safety measures Specification document should include safety standards and guidelines which system complies with e. g. : Underwriters Laboratory, FCC, FDA, FAA, AEC, NASA, ISO, NHTSA, etc. Some industry standards / procedures: FAA: DO 178 B (and newer Do 178 C). Medical device industry: ISO 14971 Nuclear power industry (& others): IEC 61508, "Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES)" areas:

Methods --Process and Tool Chain evaluation (this is the main focus of DO 178 B) --Probability-based models --Formal methods --Traditional methods for code testing, e. g. , basis path testing --Standard code-checking tools (e. g. , avoiding inclusion of redundant code)

Design and review process: steps fig_08_01

Coding: Trade-off: traditional efficiency (speed/space) vs better reliability Some examples: 1. Array declarations: const may not be required but is preferred, e. g. : const int size = 5; int myarray[size]; 2. Make sure initialization is explicit, do not depend on compiler, e. g. : int tot =0; for (int j=0; j<10; j++) tot = tot + j; 3. Do not depend on lazy evaluation, e. g. : if (( a != 0) && (b/a < 0)) if (a!=0) if (b/a < 0)