Orcmid's Lair
Orcmid's Lair
status 
 
privacy 
 
contact 

Welcome to Orcmid's Lair, the playground for family connections, pastimes, and scholarly vocation -- the collected professional and recreational work of Dennis E. Hamilton

This page is powered by Blogger. Isn't yours?

Recent Items
 
Ariane 501: A "History's Worst Software Bug" Provides the Wrong Lesson
 
Phish Eggs: "We Can't Do Anything About That ..."
 
ODMA: The Little Middleware That Could
 
What Programmers Do
 
The Comfort of Open Development Processes
 
Abandoning an M.Sc
 
Relaxing Patent Licenses for Open Documents
 
Windows Media Center's Been Good To Me ...
 
Responding to Hurricane Katrina
 
Consigning Software Patents to the Turing Tar Pit

2005-12-31

Ariane 501: A "History's Worst Software Bug" Provides the Wrong Lesson

Wired News: History's Worst Software Bugs.  Well there you have it:  the Ariane Flight 501 destruction on June 4, 1996, is attributed to one of history’s ten worst software bugs.  

On researching this incident for a class assignment in May 2003, I learned that there was no programming error behind the loss of Ariane Flight 501.  I keep wanting to write the perfect unravelling of this myth.  Meanwhile, someone whose work I admire greatly has fallen for it.  Drat.

Here’s a summary that I’ve pulled together about the chain of events and its import for system engineering:

It is simply not the case that a  programming error involving a numeric conversion (some call it an overflow) was the root cause of the failure of Ariane 5 Flight 501.  That's not what happened.  

That's right.  There was no bug.  There was no coding or software-implementation error.  The software did what it was designed to do and what it was agreed that it should do. 

The programming-error story has so strongly taken its place in the software technocracy's popular wisdom that the most important and actionable lessons are completely obscured.

There was indeed a chain of events that doomed the flight:

  • an out-of-range data condition in a calculation that wasn't even needed,
  • the by-design throwing of an uncaught exception,
  • and the automatic shutdown of the launch vehicle's active and backup inertial reference systems. 

As the result of the unanticipated failure mode and a diagnostic message erroneously treated as data, the guidance system ordered violent attitude correction.  The ensuing disintegration of the over-stressed vehicle triggered the pyrotechnic destruction of the launcher and its payload.

This is not about a programming error.  It is about system-engineering and design failures.  There was not even a finding of liability, and that's important to understand as well.

How the Ariane developers regarded software defects and subsystem failures is more important.  It is also important to appreciate how the particular software came to be used under conditions it was not designed for and wasn't even required for.  These were not matters to be solved by improved debugging methods, attention to numerical-computation edge cases, using a programming language other than Ada, or changing programmer-level development methodology.  All of those, as desirable as they might be, are insufficient to have saved Ariane Flight 501. 

The 1996 Board of Inquiry nailed the important lessons and the appropriate remedies. 

It becomes interesting to consider why we are so satisfied with the myth instead of the reality:

  1. What actually happened?
  2. What did we made make it mean?
  3. How do we get the most-valuable lesson?
  4. Why do we find the myth so much more preferable?

I’ll be filling in those details and speculations in updates to my Ariane 501: A Costly Risk Myth analysis.  Meanwhile, I want to put a flag on this on-going play before it becomes more-deeply chiselled into the mythos of software engineering.  The current draft has a bibliography that you can use to check my analysis. 

Also, I invite you to compare the description, above, with the other “History’s Worst” and notice the difference of this chain of events from those nine other incidents.  There are similarities with the diagnostic message treated as data (a hidden failure-protocol defect) and some other facets, but not with any exception that was intentionally allowed in an approved design.

{tags: HonorTagAdvocate Ariane501 software bugs software engineering technology myths}

Updated 2006–01–01T18:02Z: I noticed a double wording and took the opportunity to clean up some other passages in small ways, along with addition of tags.  The main point remains.  The Ariane Flight 501 failure is a rich case study of the ways that our complex system integrations can fail, even after extensive periods of successful operation.  Its lessons apply far beyond the struggle to eradicate unanticipated computer-arithmetic-boundary (or buffer-allowance) overflows.
Again 2006–01–01T22:29Z: A miswording is corrected.

 
Comments: Post a Comment
 
Construction Zone (Hard Hat Area) You are navigating Orcmid's Lair.

template created 2002-10-28-07:25 -0800 (pst) by orcmid
$$Author: Orcmid $
$$Date: 06-02-04 21:23 $
$$Revision: 1 $

Home