Reliability of Shuttle Destruct System
Reliability of Shuttle Destruct System - Commission vs. Omission
Challenger Disaster - Shuttle computers
SRBs and Challenger
SRBs and Challenger - Solid Propellants and What the Computers Should Monitor
 

----------------------------------------------------------------------

Date: 28 Jan 86 14:06:00 CDT
From: "MARTIN J. MOORE" <mooremj@eglin-vax>
Subject: Reliability of Shuttle Destruct System [LONG]
To: "risks" <risks@sri-csl>
Reply-To: "MARTIN J. MOORE" <mooremj@eglin-vax>

Copyright (c) 1986 Martin J. Moore          [COMMENT: READERS -- PLEASE OBSERVE
                                             THE RESTRICTIONS ON THIS MESSAGE
                                             AT THE END OF THE MESSAGE.  PGN]

> From: Peter G. Neumann <Neumann@SRI-CSL.ARPA>
> For those of you who haven't heard, the Challenger blew up this morning...
> One unvoiced concern from the RISKS point of view is the presence on each
> shuttle of a semi-automatic self-destruct mechanism.  Hopefully that
> mechanism cannot be accidentally triggered.  [COMMENT: I did not intend
                                              to imply that as the cause --
                                              only to raise concern about the
                                              safety of such mechanisms.  PGN]

Peter, I assume that you are talking about the Range Safety Command Destruct
System, which is used to destroy errant missiles launched from Cape Canaveral.
From 1980 to 1983 I was the lead programmer/analyst on the ground portions of
that system, and I am the primary author of the software which translates the
closing of destruct switches into the RF destruct signals sent to the vehicle.
I think I can address the question of whether the system can be accidentally
triggered; worrying about that gave me nightmares off and on for months
while I was on the project.  I'd like to tell you a little about the system
and why I think the answer is No.  Note that my information is now three years
old, and some details may have changed; there may also be minor errors in
detail due to lapses in my memory, which isn't as good as my computer's!

On board the vehicle, there are five destruct receivers: one on the external
tank (ET) and two on each of the solid rocket boosters (SRBs).  There is no
receiver or destruct ordnance on the Orbiter; it is effectively just an
airplane.  The casing of each SRB is mined with HMX, a high explosive; the ET
contains a small pyrotechnic device which causes its load of liquid hydrogen
and liquid oxygen to combine and combust.  The receivers and explosives are
connected such that the receipt of four proper ARM sequences followed by
a proper FIRE sequence by any of the receivers will explode the ordnance.

The ARM sequence and FIRE sequence must come from the ground; they cannot be
generated aboard the vehicle.  These sequences are transmitted on a frequency
which is reserved, at all times, for this purpose and this purpose alone.
There are several transmitters around the Eastern Test Range which can be used
to transmit the codes.  These transmitters have a power of 10 kw (continuous
wave).  The ARM and FIRE sequences consist of thirteen tone pairs (different
for each command and changed for each launch).  There are eight possible
tones, resulting in 28 possible tone pairs; thus, there are (28^13) or
slightly over 6.5E18 correct sequences.

The Range Safety Officer has two switches labeled "ARM" and "DESTRUCT".
When he throws a switch, it generates an interrupt in the central processor
(there are actually two central processors running and receiving all inputs,
but only one is on-line at any time; in case of software or hardware error
the backup is switched in.  And yes, they have different power sources.)
The central program checks for the correct code on each of two different
hardware lines (the correct code is different for each line); if correct,
and all criteria are met to allow the sequence to be sent, the central program
requests the tone pairs for that sequence from another processor.  That
processor (like everything else in the system, actually redundant processors)
has only one function: to store and deliver those tone pairs.  The processor
resides in a special vault and can only be accessed in order to program the
tone pairs (which are highly classified) before each launch.  The data line
between the central processor and the storage processor is electrically
connected ONLY when the ARM or DESTRUCT switch is actually thrown; this
prevents a wild program from retrieving the tone pairs.

When the central program has retrieved the tone pairs, it formats a message
to the currently selected remote transmitter.  As the final step before
sending the message, the program checks the switch hardware one more time
to make sure the command is, in fact, requested.  If so, the message is sent
to the site on two modems (with different power supplies and geographically
diverse communications paths) and, after sending the message, erases the tone
paris from its memory.  The remote site, until this time, does not know the
tone pairs.  When the site receives and validates the message, it sends a
request for confirmation back to the central processor.  When Central
receives this request, it checks the switch hardware again and retrieves a
fresh copy of the tone pairs from the storage processor to make sure that the
site got the correct tone pairs.  If all these checks pass, Central issues
a go-ahead message to the site, which then (if the message is validated)
actually transmits the sequence to the vehicle.  During this sequence of
messages, if any message fails, it is retransmitted, with a check of the
switch hardware before each transmission.

Let's look at some areas that could cause an accidental trigger:

1. Failure of switch hardware.  This would take at least six circuits failing
   to the "1" state, while 12 others connected to them would have to NOT fail.

2. Central software error.  There is a lot of reliabilty checking, details of
   which are too long to repeat here; but even if there is a hole through it,
   the central program cannot get the tone pairs unless the switch is thrown!

3. Site software error.  Doesn't have the tone pairs until sent by Central.

4. Destruct receiver failure.  I didn't work with this directly (being
   strictly on the ground side) but everything I've seen makes them look
   very reliable and fail-safe.

5. External sabotage.  A hostile agent would have to (1) steal the tone pairs,
   and (2) overpower our 10 kw CW transmitters which are saturating the
   destruct receivers with a 70 dB margin.  Alternatively, if someone tried
   to overpower the central area, I think they would fail.  Security is TIGHT
   around the central control area;  I don't think I can go into detail without
   upsetting NASA and the Air Force.

7. Internal sabotage.  One thing I did was to imagine that I was a saboteur
   and think of a way that I could program in a Trojan Horse to send a false
   command.  Eventually, the system was such that I could not do it.  NASA
   also hired an independent contractor to perform reliability analyses.
   NOBODY can send a command except the Range Safety Officer when he throws
   the switch.

The Challenger explosion was NOT caused by the Range Safety system, either
intentional or accidental.

I am really sorry about the length of this message, but I wanted to get all of
that in.  All information contained herein is UNOFFICIAL and furnished for
information purposes only.  It is in no way official information from my
employer (RCA), the U.S. Air Force, NASA, or any other government agency.

Due to the sensitive nature of this incident, this article is not for
reproduction or retransmittal without the express permission of the author.
Permission is hereby granted to Peter G. Neumann to include this material
in the RISKS electronic mail digest.

                                      Martin J. Moore
                                      mooremj@eglin-vax.arpa

[MARTIN: MANY THANKS FOR THIS EXTRAORDINARY MESSAGE.
 READERS: PLEASE OBSERVE THE ABOVE CAVEAT SCRUPULOUSLY.  PGNeumann]

----------------------------------------------------------------------
----------------------------------------------------------------------

Received: from eglin-vax.ARPA ... Mon 17 Mar 86 17:26:49-PST
Date: 0  0 00:00:00 CDT
From: "MARTIN J. MOORE" <mooremj@eglin-vax>
Subject: Commission vs. Omission
To: "risks" <risks@sri-csl>

Dave Parnas's points regarding the shuttle destruct system are well taken.
The policy, stated informally, was that "it better work if we need it --
but it absolutely better NOT 'work' when we DON'T need it" which generated the
extreme emphasis on preventing what Dave calls "risks of commission."  I feel
that the risk of commission on the destruct system is extremely small, while
the risk of omission is somewhat higher, although still small.  During
validation testing and in every pre-launch checkout, we performed "exhaustive"
checks -- "exhaustive" meaning that we tried every combination of
  [(2 central computers) * (6 remote sites) * (2 computers per site)
      * (2 transmitters per site) * (2 comm paths to each site)
      * (2 possible commands in various sequences)].
Yeah, this takes a *LONG* time (with practice, we got it down
to several hours if everything went smooth.)  On one occasion during
validation testing, we did find a software error which only manifested on a
particular (central computer/comm path/remote computer/unusual command
sequence) combination.  Exhaustive tests *are* necessary.

I have often wondered why the emphasis was to prevent errors of commission
over errors of omission (not to say that we wanted either kind, but errors
of commission were definitely considered to be worse!).  An erroneous
destruct would cost the lives of the flight crew, loss of the Orbiter, and
possibly damage on the ground if it occurred early in the flight (e.g.,
windows blown out, etc.)  An erroneous non-destruct, in the worst case (if
the ET were to detonate near the crowded spectator area on the NASA
causeway), could cause the loss of TENS OF THOUSANDS of lives.  Certainly
this is worse than an erroneous destruct.  I believe there may be a
subconscious feeling that an erroneous destruct means the difference between
a success and a disaster, while an erroneous non-destruct means the
difference between a disaster and a worse disaster.  Subjectively, that
difference is not as great as the first, although objectively it may be much
greater.
                                     Martin Moore

<The usual disclaimers.  I'm too tired to type in the whole silly thing.>

        [By the way, Dave Parnas suggested the following example to
         illustrate his message in RISKS-2.28:]

         "Consider elevators.  Consider how much easier it is to prevent the
         floor indicator from saying "13" than to assure that the floor
         indicator will always give the actual floor that the elevator is
         on.  The risk of indicating "13" can be gotten acceptably low by
         eliminating "13" from the set of indicator lights.  The risk of
         indicating an incorrect floor or not indicating the current floor
         is much harder to eliminate."  [Dave Parnas]

----------------------------------------------------------------------
----------------------------------------------------------------------

Date: Tue 4 Feb 86 12:34:09-EST
From: Marc Vilain <MVILAIN@G.BBN.COM>
Subject: Shuttle computers
To: risks@SRI-CSL.ARPA

The following is excerpted from this Sunday's New York Times.  It may
be somewhat old news to some, but does a good job of summarizing much of
the evidence and arguments surrounding the Challenger's computers.

           SHUTTLE EXPERTS DOUBT COMPUTERS COULD DETECT FIRE
                           By David E. Sanger

   The computers and sensors that guided the flight of the space shuttle
Challenger appear not to have been programmed to detect flames burning
throught the sides of a solid-fuel booster rocket, experts familiar with the
shuttle system said yesterday.

   Their comments came as evidence accumulated that the right-side booster
began to fail as much as 10 seconds before the explosion that destroyed the
craft, as reported yesterday in the New York Times.

   Even if the sensors had picked up the first signs of fire, safety
measures built into the system to protect the astronauts would have
prevented the shedding of the giant external fuel tank that exploded soon
after, NASA officials and the computers' designers said.

                            Only From Pilot

   That command could have come only from the pilot, and officials said they
doubted even that could have saved the crew.
   ...
   Experts who have studied the shuttle's computer system say it was not
programmed to separate the orbiter automatically from its fuel supply in
part because of the fears that faulty sensor readings could cause the
computers to abort a mission unnecessarily, risking the lives of the crew.

                      Preparation for Emergencies

   Still the possibility that there were signs of trouble as long as 10
seconds before the explosion raised some questions yesterday about the
enormously complex equipment that guides the shuttle.
   ...
   "The possibility that a booster might burn through could well have
been a failure mode that was never considered," said Alfred Spector, a
Carnegie-Mellon professor who two years ago conducted a study of the
computer system guiding the shuttle.

   NASA officials said little publicly in response to the report that
data sent from the shuttle showed a sudden drop in the power of the
right booster rocket about 10 seconds before the spacecraft exploded.

   But computer experts said the computer's response to such a power drop
may have been executed flawlessly.  The program, they said, was primarily
designed to correct for the effects of an uneven rocket thrust by swiveling
engine nozzles to the side, keeping the shuttle on course.  Sources close to
the situation say that the ground data show that the nozzles had in fact
swiveled to one side.

   In the absence of other danger signals, however, the computer would not
have searched for the cause of the power loss.  And the initial signals
apparently indicated only a 4 percent decrease in thrust, a figure that the
computer, or the cabin crew and officials at the Johnson Space Center in
Houston, may have judged did not indicate a serious problem.
   ...
   [End of excerpt]

----------------------------------------------------------------------
----------------------------------------------------------------------

To: risks@sri-csl.arpa
cc: space@s1-b.arpa
Subject: SRBs and Challenger
Date: Mon, 03 Feb 86 21:06:59 -0800
From: Mike Iglesias <iglesias@UCI.EDU>

According to this morning's LA Times:

 - Early shuttle flights had sensors on the SRBs to monitor performance,
   but they were removed to save weight when it was felt that the SRBs
   were performing well.  The sensors monitored pressure, temperature
   and vibration in the SRBs.

 - Two Rockwell officials familiar with the NASA inquiry said that NASA
   data shows that the 3 main engines experienced a power loss just
   before the explosion.  The power loss was noted between one-tenth and
   one-one hundreth of a second before the explosion.  The SRB that
   probably caused the explosion suffered a 3% loss of power (about
   100,000 pounds of thrust) seconds before.

 - NASA noted that even if there were sensors on the SRBs, little can
   be done to save the crew if there is a problem during the first 2
   minutes during the flight.  They might be able to jettison the SRBs,
   but it would be difficult to stay clear of them and the external
   tank.  And another NASA spokesman said later that the crews don't
   train for that maneuver, and that NASA documents state that such
   an escape is possible only after the SRBs have completed firing.
   The shuttle would have a near-impossible task of ditching in the
   ocean if it was able to steer clear of the SRBs and the ET.

 - Other Rockwell sources said that telemetry shows that the external
   tank experienced an increase in pressure in both the oxygen and
   hydrogen tanks, and that pressure relief valves in both tanks
   popped to decrease some of the pressure.

Could the crew have survived had they known about the problem?  Who knows?
Maybe, if they had known about the SRB problem in time, if they had been
able to get away from the SRBs and the ET, and been able to ditch successfully
in the ocean.  That's a lot of ifs...

I wonder if NASA is going to think twice about removing sensors after this...

Mike Iglesias
University of California, Irvine

----------------------------------------------------------------------
----------------------------------------------------------------------

Date: Mon, 3 Feb 86 07:14:54 pst
From: malloy@nprdc.arpa (Sean Malloy)
To: RISKS@SRI-CSL
Subject: Solid Propellants and What the Computers Should Monitor

 >Date: Sun, 2 Feb 86 14:08:17 est
 >From: mikemcl@nrl-csr (Mike McLaughlin)
 >Subject:  SRBs and What the Computers Should Monitor

 >Another result could be that the errant jet impinged on the main fuel tank,
 >heating, penetrating, and igniting the fuel load. (It might be able to ignite
 >it without penetrating the tank structure.)  *This should be quickly detec-
 >table by excursions in tank pressure.*  Reaction times, even of computers,
 >might not be fast enough to make any difference in the outcome.

>I believe that both of the above could have been detected with instrumentation
 >that was certainly on board.  Additional (or existing?) instrumentation could
 >detect temperature changes in SRB and fuel tank skins, torques on SRB mounts,
 >abnormal "seismic" vibrations within the SRB structure, abnormal "plumes",
 >etc.

One of the points that was brought up during the broadcasts the day of the
disaster was that the telemetry tapes were going to have to be analyzed to
determine if there was any indication as to what happened.  The temperature
data for the external tank was specifically mentioned as one of the
telemetry streams that was NOT fed to a display in either the launch control
area or Mission Control. The NASA spokesman explained that there was so much
information coming in that a decision had to be made to limit what the
launch control personnel had to pay attention to.

This brings up a much more subtle problem in risk evaluation -- what data is
considered relevant to the task at hand? A line has to be drawn between
significant and extraneous data, based on the processing capacity of the
system/personnel interpreting the data. NASA had decided that the ET
temperatures were not of immediate use to the launch control personnel, and
simply recorded the data.  In the previous 24 shuttle launches, they were
right; in this case, they were wrong. In the future, they probably will have
someone monitoring that data. What also has to be considered in the decision
is what can be done on the basis of a given stream of data. I don't know how
long the ET temperatures would have been elevated before the explosion, so I
don't know whether there would have been time to recognize the problem,
identify the source, and jettison the SRBs. If you can show that there won't
be enough time to react properly, then giving someone responsibility for
making the right decision in that situation is asking someone to volunteer
to have a nervous breakdown.

In retrospect, there should have been immediate scrutiny of the SRB
performance. Looking at the pictures of the exhaust trails after the
explosion, one of the SRBs is looping away from the blast apparently
undamaged, while the trail from the other proceeds straight for a
short distance, then peters out abruptly. Why would one survive
unscathed while the other one was badly damaged unless something
happened with or adjacent to the SRB? Hindsight is always 20/20.

 Sean Malloy
 (malloy@nprdc-arpa)

----------------------------------------------------------------------