Orcmid's Lair

Writings W040400
Risks of Risk Speculation

orcmid>
writings>

W040400>
0.00 2004-09-29 -12:00 -0700

In Spring, 2003, I participated in an on-line Software Engineering class.  In the eighth, final, week we were assigned the following question for discussion [1]: 

When large software systems are developed by many teams in many different locations, how should liabilities be assigned?  To answer the question, consider the example of the Ariane 5 Rocket.  In 1996, an unmanned Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guiana.  The rocket was on its first voyage, after a decade of development with millions of lines of programming code costing $7 billion.  An inquiry found that the cause of the failure was a software error in the inertial reference system.  Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer. The number was larger than 32,768, the largest integer storable in a 16 bit signed integer, and thus the conversion failed.
http://www.ima.umn.edu/~arnold/disasters/ariane.html [Arnold2000]

I don't know how many times I have heard about the loss of the Ariane flight 501 in this or similar terms.  Our Software Engineering text holds this incident up as the kind of failure that a successful Verification and Validation (V&V) program might have avoided [Sommerville2001:p.468].  The Inquiry Board concurred that the failure could have been prevented if there had been a full-dress simulated launch of Ariane 5 Flight 501 critical systems, including the flight profile [Lions1996: 3.1(s), 3.2]

On examining the complete available materials on the failure of the Ariane flight 501 failure, I made this remarkable discovery:  

It is simply not the case that a  programming error involving a numeric conversion (some call it an overflow) was the root cause of the failure of Ariane 5 Flight 501.  That's not what happened.  

That's right.  There was no bug, no coding or implementation error.  The software did what it was designed to do.  This is not about a programming error.  It is about engineering and design failures, born out in the identification of remedies [ESA1996].

There was indeed a chain of events that led to a software-detected out-of-range condition and automatic shutdown of the inertial reference system.  This exposed a serious integration bug and failure mode that led the rocket to disintegrate, and the disintegration triggered (by non-computer electromechanical means) the  pyrotechnic destruction of the launcher and its payload.

How the failure was induced and why the software was being used under conditions it was not designed for and wasn't even required for is far more valuable to understand than trivializing it as something that could have been caught by proper debugging or attention to proper numerical computation.  Those are certainly important practices. That is not the valuable lesson of Ariane 5 Flight 501.

My interest here is to examine this established software myth from the following perspective:

  1. What actually happened?
  2. What were the findings of the Inquiry Board?
  3. Why is this a success for software engineering?
  4. What is there that has us portray this event as a lesson about programming and stupid bugs?

-- Dennis E. Hamilton
Seattle, Washington
2004 April 5


1What Happened?

2. What Are the Findings?

3. How Did Software Engineering Succeed?

4. Why Do We Trivialize This Lesson?

References

End Notes


1. What Happened?

 

References

[Arnold2000
Arnold, Douglas N.  The Explosion of the Ariane 5.  Education Related Materials page.  Institute for Mathematics and Its Applications.  University of Minnesota.  (Minneapolis: 2000 August 23).   Posted on the web at <http://www.ima.umn.edu/~arnold/disasters/ariane.html>.
     The Ariane 501 flight failure description is found in a collection of "disasters attributable to bad numerics" where the explosion is described as "ultimately the consequence of a simple overflow."  There is in fact no such finding in the Board of Inquiry Report [Lions1996] cited in this article, and the three quotations are from different parts of the report and in a different sequence [2].  
     
[ESA1996]
ESA.  Ariane 501 - Presentation of Inquiry Board report.  Press Release 33-1996.  European Space Agency.  (Paris: 1996 July 23).  Published on-line at <http://www.esa.int/export/esaLA/Pr_33_1996_p_EN.html>.  The PDF version of [Lions1996] is linked from this page.
   
[Lions1996]
Lions, Jacques-Louis (Chairman).  Ariane 5 Flight 501 Failure.  Report by the Inquiry Board.  (Paris: 1996 July 19).  PDF version at <http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf>.  An HTML version of the report, without the presentation slides and diagrams, has been preserved by Douglas Arnold at <http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html>.
   
[Sommerville2001]
Sommerville, Ian.  Software Engineering, ed.6.  Addison-Wesley (Boston: 2001).  ISBN 0-201-39815-X.  "It is not uncommon for verification and validation to take up more than 50 percent of the total development costs for critical-software systems.  This cost is, of course, justified if an expensive system failure is avoided.  For example, in 1996 a mission-critical software system on the Ariane 5 rocket failed and several satellites were destroyed.  The consequential loss was hundreds of millions of dollars." section 2.1, Critical system validation, p.468.


   

End Notes

[1] The statement of the assignment includes text similar to that in [Arnold2000].
The portion spanning from "An inquiry found that" to "and thus the conversion failed" is a direct quotation except Arnold correctly gives the largest 16-bit signed-integer value as 32767 (the largest value fitting in 15 bits). 
   
[2] The Inquiry Board makes the following statements in different parts of the report [Lions1996]:
(a) "On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded." -- quoted from the Foreword
(b) "The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer." -- quoted from section 2.1, Chain of Technical Events.  That passage continues: "This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, ... ."  We would now say that the situation caused an exception to be thrown.  The SRI was designed to treat that kind of exception as fatal. The very next passage identifies the problem with the exception being thrown where and when it occurred after lift-off: "The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose."
(c) "The failure of the Ariane 501 was caused by the complete loss of guidance and attitude information 37 seconds after start of the main engine ignition sequence (30 seconds after lift- off). This loss of information was due to specification and design errors in the software of the inertial reference system.' -- quoted from section 3.2, Cause of the Failure.  This short section has only this to add: "The extensive reviews and tests carried out during the Ariane 5 Development Programme did not include adequate analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure."

0.00 2004-04-05 Create Basic Article Structure (orcmid)
Bill Anderson called and pointed out one more place where the folklore about the Ariane 501 failure being a software problem has come up once again, and we want to have something somewhere to discuss the strangeness of that persistent claim, when the failure was quite different than that.  I begin by gathering the notes I have, and sketching the approach.
Hard Hat Area You are navigating Orcmid's Lair

created 2004-04-05-18:21 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 05-02-11 16:48 $
$$Revision: 10 $

Home