In Spring, 2003, I participated in an on-line Software
Engineering class. In the eighth, final, week we were assigned the
following question for discussion [1]:
When large software systems are developed by many teams in many different
locations, how should liabilities be assigned? To answer the question,
consider the example of the Ariane 5 Rocket. In 1996, an unmanned Ariane 5
rocket launched by the European Space Agency exploded just forty seconds after
its lift-off from Kourou, French Guiana. The rocket was on its first voyage,
after a decade of development with millions of lines of programming code
costing $7 billion. An inquiry found that the cause of the failure was a
software error in the inertial reference system. Specifically a 64 bit
floating point number relating to the horizontal velocity of the rocket with
respect to the platform was converted to a 16 bit signed integer. The number
was larger than 32,768, the largest integer storable in a 16 bit signed
integer, and thus the conversion failed.
http://www.ima.umn.edu/~arnold/disasters/ariane.html [Arnold2000]
I don't know how many times I have heard about the loss of the Ariane
flight 501 in this or similar terms. Our Software Engineering text holds
this incident up as the kind of failure that a successful Verification and
Validation (V&V) program might have avoided [Sommerville2001:p.468].
The Inquiry Board concurred that the failure could have been prevented if there
had been a full-dress simulated launch of Ariane 5 Flight 501 critical systems,
including the flight profile [Lions1996: 3.1(s), 3.2]
On examining the complete available materials on the failure of the
Ariane flight 501 failure, I made this remarkable discovery:
It is simply not
the case that a programming error involving a numeric conversion (some
call it an overflow) was the root cause of the failure of Ariane 5 Flight
501. That's not what happened.
That's right. There was no bug, no coding or implementation
error. The software did what it
was designed to do. This is not about a programming error. It
is about engineering and design failures, born out in the identification of
remedies [ESA1996].
There was indeed a chain of events that led to a software-detected
out-of-range condition and automatic shutdown of the inertial reference
system. This exposed a serious integration bug and failure mode that led the rocket to
disintegrate, and the disintegration triggered (by non-computer
electromechanical means) the pyrotechnic destruction of the launcher and
its payload.
How the failure was induced and why the software was being used under
conditions it was not designed for and wasn't even required for is far more
valuable to understand than trivializing it as something that could have been
caught by proper debugging or attention to proper numerical computation.
Those are certainly important practices. That is not the valuable lesson of
Ariane 5 Flight 501.
My interest here is to examine this established software myth from the
following perspective:
- What actually happened?
- What were the findings of the Inquiry Board?
- Why is this a success for software engineering?
- What is there that has us portray this event as a lesson about
programming and stupid bugs?
-- Dennis E. Hamilton
Seattle, Washington
2004 April 5
2. What Are the Findings?
3. How Did Software Engineering Succeed?
4. Why Do We Trivialize This Lesson?
End Notes
1. What Happened?
- [Arnold2000]
- Arnold, Douglas N.
The Explosion of the Ariane 5. Education
Related Materials page. Institute for Mathematics and Its
Applications. University of Minnesota. (Minneapolis: 2000 August
23). Posted on the web at <http://www.ima.umn.edu/~arnold/disasters/ariane.html>.
The Ariane 501 flight failure description is found
in a collection of "disasters
attributable to bad numerics" where the explosion is described as
"ultimately the consequence of a simple overflow." There is
in fact no such finding in the Board of Inquiry Report [Lions1996]
cited in this article, and the three quotations are from different parts of
the report and in a different sequence [2].
- [ESA1996]
- ESA. Ariane 501 - Presentation of Inquiry Board report. Press
Release 33-1996. European Space Agency. (Paris: 1996 July
23). Published on-line at <http://www.esa.int/export/esaLA/Pr_33_1996_p_EN.html>.
The PDF version of [Lions1996] is linked from this
page.
- [Lions1996]
- Lions, Jacques-Louis (Chairman). Ariane 5 Flight 501 Failure.
Report by the Inquiry Board. (Paris: 1996 July 19). PDF version
at <http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf>.
An HTML version of the report, without the presentation slides and diagrams,
has been preserved by Douglas Arnold at <http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html>.
- [Sommerville2001]
- Sommerville, Ian. Software Engineering, ed.6.
Addison-Wesley (Boston: 2001). ISBN 0-201-39815-X. "It is
not uncommon for verification and validation to take up more than 50 percent
of the total development costs for critical-software systems. This
cost is, of course, justified if an expensive system failure is
avoided. For example, in 1996 a mission-critical software system on
the Ariane 5 rocket failed and several satellites were destroyed. The
consequential loss was hundreds of millions of dollars." section 2.1,
Critical system validation, p.468.
- [1] The statement of the assignment includes text similar to that in [Arnold2000].
- The portion spanning from "An inquiry found that" to "and thus
the conversion failed" is a direct quotation except Arnold
correctly gives the largest 16-bit signed-integer value as 32767 (the
largest value fitting in 15 bits).
- [2] The Inquiry Board makes the following statements in
different parts of the report [Lions1996]:
- (a) "On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a
failure. Only about 40 seconds after initiation of the flight sequence, at an
altitude of about 3700 m, the launcher veered off its flight path, broke up
and exploded." -- quoted from the Foreword
(b) "The internal SRI software exception was caused during execution of a
data conversion from 64-bit floating point to 16-bit signed integer value. The
floating point number which was converted had a value greater than what could
be represented by a 16-bit signed integer." -- quoted from section 2.1, Chain of
Technical Events. That passage continues: "This resulted in an
Operand Error. The data conversion instructions (in Ada code) were not
protected from causing an Operand Error, ... ." We would now say
that the situation caused an exception to be thrown. The SRI was
designed to treat that kind of exception as fatal. The very next passage
identifies the problem with the exception being thrown where and when it occurred
after lift-off: "The error occurred in a part of the software that only
performs alignment of the strap-down inertial platform. This software module
computes meaningful results only before lift-off. As soon as the launcher
lifts off, this function serves no purpose."
(c) "The failure of the Ariane 501 was caused by the complete loss of
guidance and attitude information 37 seconds after start of the main engine
ignition sequence (30 seconds after lift- off). This loss of information was
due to specification and design errors in the software of the inertial
reference system.' -- quoted from section 3.2, Cause of the Failure.
This short section has only this to add: "The extensive reviews and tests
carried out during the Ariane 5 Development Programme did not include adequate
analysis and testing of the inertial reference system or of the complete
flight control system, which could have detected the potential failure."
- 0.00 2004-04-05 Create Basic Article
Structure (orcmid)
- Bill Anderson called and pointed out one more place where the folklore
about the Ariane 501 failure being a software problem has come up once
again, and we want to have something somewhere to discuss the
strangeness of that persistent claim, when the failure was quite
different than that. I begin by gathering
the notes I have, and sketching the approach.
|
You are navigating Orcmid's Lair
|
created 2004-04-05-18:21 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 05-02-11 16:48 $
$$Revision: 10 $
|
Home