CSCI 5230 Project Management Software Reuse Disasters Therac25

  • Slides: 13
Download presentation
CSCI 5230: Project Management Software Reuse Disasters: Therac-25 and Ariane 5 Flight 501 David

CSCI 5230: Project Management Software Reuse Disasters: Therac-25 and Ariane 5 Flight 501 David Sumpter 12/4/2001 12/04/01

Contents Introduction Therac-25 – Background Therac-25 – Software Process Therac-25 – Causes Flight 501

Contents Introduction Therac-25 – Background Therac-25 – Software Process Therac-25 – Causes Flight 501 – Sequence of Events Ariane 5 Software Process Flight 501 – Cause of Failure Conclusion References 12/04/01 3 4 5 6 7 8 9 10 11 2

Introduction Two famous software engineering disasters l Therac-25 l l Medical accelerator to treat

Introduction Two famous software engineering disasters l Therac-25 l l Medical accelerator to treat tumors 6 known accidents resulting in death or serious injury l l l Software was adapted from earlier models Ariane 5 Flight 501 l Maiden flight of Ariane 5 launch vehicle l l l Larger, more powerful successor to the Ariane 4 Exploded approximately 40 seconds after launch l 12/04/01 June 1985 – January 1987 June 1996 Loss traced to software carried over virtually unchanged from Ariane 4 3

Therac-25 – Background 25 Me. V medical accelerator l l Designed to destroy tumors

Therac-25 – Background 25 Me. V medical accelerator l l Designed to destroy tumors Dual Mode l l Electron beam or X-rays Successor to Therac-6, Therac-20 l l Computer control added to earlier machines Still capable of stand-alone (no computer) operation l All standard hardware safety mechanisms Therac-25 more dependent on software l Lacked many hardware safety mechanisms of earlier accelerators Therac-25 software “evolved from” Therac-6 code l l 12/04/01 PDP-11 assembly, no standard OS Also contained Therac-20 code 4

Therac-25 – Software Process Little, if any, process Single programmer Minimal unit and software

Therac-25 – Software Process Little, if any, process Single programmer Minimal unit and software testing l Emphasis on integrated system testing 1983 safety analysis, in effect, assumed that software had no errors! l l 12/04/01 “Programming errors have been reduced by extensive testing. . . Any residual software errors are not included in the analysis. ” “Computer execution errors are caused by faulty hardware components and by ‘soft’ (random) errors induced by alpha particles and electromagnetic noise. ” 5

Therac-25 – Causes For three of the six known incidents, cause is unknown l

Therac-25 – Causes For three of the six known incidents, cause is unknown l “there is no way to determine what particular design errors were related to the. . . accidents. Given the unsafe programming practices in the code, it is possible that unknown race conditions or errors could have been responsible” For two fatal accidents in Tyler, Texas l l Race condition led to inconsistent machine settings, leading to massive radiation overdoses Same bug found in Therac-20 l Hardware safeguards prevented it from causing injuries, from even being discovered until after the Tyler accidents Fatal accident at Yakima, Washington l 12/04/01 Overflow of 1 -byte variable which led, under rare conditions, to improper machine settings, leading to massive radiation overdose. 6

Flight 501 – Sequence of Events Near-simultaneous failure of the primary and back-up Inertial

Flight 501 – Sequence of Events Near-simultaneous failure of the primary and back-up Inertial Reference Systems (SRIs) l 36 seconds after main engine ignition Nozzles of the two solid boosters and main engine swivel to extreme positions l l Nozzles direct rocket thrust, steer launcher Caused launcher to veer abruptly Links between the solid boosters and the core stage rupture l 12/04/01 triggered self-destruct 7

Ariane 5 Software Process Stringent processes in place, but… “the culture within the Ariane

Ariane 5 Software Process Stringent processes in place, but… “the culture within the Ariane program…” only addressed “random hardware failures… which can quite rationally be handled by a backup system” “the view had been taken that software should be considered correct until it is shown to be at fault”! (emphasis added) SRIs were not included, but simulated by special software, in integrated tests l l Technically difficult and expensive SRIs considered “fully qualified at equipment level” l 12/04/01 “The design of the Ariane 5 SRI is practically the same as that of an SRI which is presently used on Ariane 4, particularly as regards the software” 8

Flight 501 – Cause of Failure Software exception “during… data conversion from 64 -bit

Flight 501 – Cause of Failure Software exception “during… data conversion from 64 -bit floating point to 16 -bit signed integer” l l Occurred in SRI software Overflow caused by unexpectedly high value for Horizontal Bias (BH) variable l l BH related to horizontal velocity Not protected to save computer processing power l l That part of software where error occurred was not needed after launch l Requirement to continue operating after launch traces to earlier versions of Arian l l 12/04/01 Analysis had determined that overflow could not occur l Reasoning not documented in code l Ariane 5 has higher horizontal velocity, early in trajectory, than Ariane 4! Enabled prompt re-start of count-down in event of a hold Did not apply to Ariane 5, but maintained for commonality 9

Conclusion In both cases, software was carried over from earlier projects where it had

Conclusion In both cases, software was carried over from earlier projects where it had seemingly worked well l Therac-25 l l Ariane 5 l l l 12/04/01 Software defects in earlier machines were hidden by hardware safeguards No real software development process Apparently no serious evaluation of risks involved in using software in lieu of hardware safeguards Known “defect” was non-issue on Ariane 4 Established software development process in place Issues were considered, but key factor was missed 10

Conclusion, cont. Misunderstanding of software? l Both were primarily hardware projects l l Not

Conclusion, cont. Misunderstanding of software? l Both were primarily hardware projects l l Not only underestimated complexity of software, but failed to recognize that it was even an issue l l Both projects made the absolutely astounding assumption that the software didn’t have errors! Assumed “black box” that could be swapped in and out of different applications l 12/04/01 Reuse of existing software in the development of new hardware No evidence that reuse was considered in design of software 11

References Inquiry Board. Ariane 5 Flight 501 Failure. Inquiry Board report (July 1996). l

References Inquiry Board. Ariane 5 Flight 501 Failure. Inquiry Board report (July 1996). l Available online at http: //www. mssl. ucl. ac. uk/www_plasma/missions/cluster/ariane 5 rep. html Leveson, N. , Turner, C. S. An Investigation of the Therac-25 Accidents. IEEE Computer, vol. 26, no. 7 (July 1993), 18 -41. l 12/04/01 Available online at http: //courses. cs. vt. edu/~cs 3604/lib/Therac_25/Therac_1. html 12

Thank You 12/04/01 13

Thank You 12/04/01 13