----------------------------------------------------------------------
Date: 28 Jan 86 14:06:00
CDT
From: "MARTIN J. MOORE"
<mooremj@eglin-vax>
Subject: Reliability
of Shuttle Destruct System [LONG]
To: "risks" <risks@sri-csl>
Reply-To: "MARTIN J.
MOORE" <mooremj@eglin-vax>
Copyright (c) 1986 Martin
J. Moore [COMMENT:
READERS -- PLEASE OBSERVE
THE RESTRICTIONS ON THIS MESSAGE
AT THE END OF THE MESSAGE. PGN]
> From: Peter G. Neumann
<Neumann@SRI-CSL.ARPA>
> For those of you who
haven't heard, the Challenger blew up this morning...
> One unvoiced concern
from the RISKS point of view is the presence on each
> shuttle of a semi-automatic
self-destruct mechanism. Hopefully that
> mechanism cannot be
accidentally triggered. [COMMENT: I did not intend
to imply that as the cause --
only to raise concern about the
safety of such mechanisms. PGN]
Peter, I assume that
you are talking about the Range Safety Command Destruct
System, which is used
to destroy errant missiles launched from Cape Canaveral.
From 1980 to 1983 I
was the lead programmer/analyst on the ground portions of
that system, and I am
the primary author of the software which translates the
closing of destruct
switches into the RF destruct signals sent to the vehicle.
I think I can address
the question of whether the system can be accidentally
triggered; worrying
about that gave me nightmares off and on for months
while I was on the project.
I'd like to tell you a little about the system
and why I think the
answer is No. Note that my information is now three years
old, and some details
may have changed; there may also be minor errors in
detail due to lapses
in my memory, which isn't as good as my computer's!
On board the vehicle,
there are five destruct receivers: one on the external
tank (ET) and two on
each of the solid rocket boosters (SRBs). There is no
receiver or destruct
ordnance on the Orbiter; it is effectively just an
airplane. The
casing of each SRB is mined with HMX, a high explosive; the ET
contains a small pyrotechnic
device which causes its load of liquid hydrogen
and liquid oxygen to
combine and combust. The receivers and explosives are
connected such that
the receipt of four proper ARM sequences followed by
a proper FIRE sequence
by any of the receivers will explode the ordnance.
The ARM sequence and
FIRE sequence must come from the ground; they cannot be
generated aboard the
vehicle. These sequences are transmitted on a frequency
which is reserved, at
all times, for this purpose and this purpose alone.
There are several transmitters
around the Eastern Test Range which can be used
to transmit the codes.
These transmitters have a power of 10 kw (continuous
wave). The ARM
and FIRE sequences consist of thirteen tone pairs (different
for each command and
changed for each launch). There are eight possible
tones, resulting in
28 possible tone pairs; thus, there are (28^13) or
slightly over 6.5E18
correct sequences.
The Range Safety Officer
has two switches labeled "ARM" and "DESTRUCT".
When he throws a switch,
it generates an interrupt in the central processor
(there are actually
two central processors running and receiving all inputs,
but only one is on-line
at any time; in case of software or hardware error
the backup is switched
in. And yes, they have different power sources.)
The central program
checks for the correct code on each of two different
hardware lines (the
correct code is different for each line); if correct,
and all criteria are
met to allow the sequence to be sent, the central program
requests the tone pairs
for that sequence from another processor. That
processor (like everything
else in the system, actually redundant processors)
has only one function:
to store and deliver those tone pairs. The processor
resides in a special
vault and can only be accessed in order to program the
tone pairs (which are
highly classified) before each launch. The data line
between the central
processor and the storage processor is electrically
connected ONLY when
the ARM or DESTRUCT switch is actually thrown; this
prevents a wild program
from retrieving the tone pairs.
When the central program
has retrieved the tone pairs, it formats a message
to the currently selected
remote transmitter. As the final step before
sending the message,
the program checks the switch hardware one more time
to make sure the command
is, in fact, requested. If so, the message is sent
to the site on two modems
(with different power supplies and geographically
diverse communications
paths) and, after sending the message, erases the tone
paris from its memory.
The remote site, until this time, does not know the
tone pairs. When
the site receives and validates the message, it sends a
request for confirmation
back to the central processor. When Central
receives this request,
it checks the switch hardware again and retrieves a
fresh copy of the tone
pairs from the storage processor to make sure that the
site got the correct
tone pairs. If all these checks pass, Central issues
a go-ahead message to
the site, which then (if the message is validated)
actually transmits the
sequence to the vehicle. During this sequence of
messages, if any message
fails, it is retransmitted, with a check of the
switch hardware before
each transmission.
Let's look at some areas that could cause an accidental trigger:
1. Failure of switch
hardware. This would take at least six circuits failing
to the
"1" state, while 12 others connected to them would have to NOT fail.
2. Central software error.
There is a lot of reliabilty checking, details of
which are
too long to repeat here; but even if there is a hole through it,
the central
program cannot get the tone pairs unless the switch is thrown!
3. Site software error. Doesn't have the tone pairs until sent by Central.
4. Destruct receiver
failure. I didn't work with this directly (being
strictly
on the ground side) but everything I've seen makes them look
very reliable
and fail-safe.
5. External sabotage.
A hostile agent would have to (1) steal the tone pairs,
and (2)
overpower our 10 kw CW transmitters which are saturating the
destruct
receivers with a 70 dB margin. Alternatively, if someone tried
to overpower
the central area, I think they would fail. Security is TIGHT
around
the central control area; I don't think I can go into detail without
upsetting
NASA and the Air Force.
7. Internal sabotage.
One thing I did was to imagine that I was a saboteur
and think
of a way that I could program in a Trojan Horse to send a false
command.
Eventually, the system was such that I could not do it. NASA
also hired
an independent contractor to perform reliability analyses.
NOBODY
can send a command except the Range Safety Officer when he throws
the switch.
The Challenger explosion
was NOT caused by the Range Safety system, either
intentional or accidental.
I am really sorry about
the length of this message, but I wanted to get all of
that in. All information
contained herein is UNOFFICIAL and furnished for
information purposes
only. It is in no way official information from my
employer (RCA), the
U.S. Air Force, NASA, or any other government agency.
Due to the sensitive
nature of this incident, this article is not for
reproduction or retransmittal
without the express permission of the author.
Permission is hereby
granted to Peter G. Neumann to include this material
in the RISKS electronic
mail digest.
Martin J. Moore
mooremj@eglin-vax.arpa
[MARTIN: MANY THANKS
FOR THIS EXTRAORDINARY MESSAGE.
READERS: PLEASE
OBSERVE THE ABOVE CAVEAT SCRUPULOUSLY. PGNeumann]
----------------------------------------------------------------------
----------------------------------------------------------------------
Received: from eglin-vax.ARPA
... Mon 17 Mar 86 17:26:49-PST
Date: 0 0 00:00:00
CDT
From: "MARTIN J. MOORE"
<mooremj@eglin-vax>
Subject: Commission
vs. Omission
To: "risks" <risks@sri-csl>
Dave Parnas's points
regarding the shuttle destruct system are well taken.
The policy, stated informally,
was that "it better work if we need it --
but it absolutely better
NOT 'work' when we DON'T need it" which generated the
extreme emphasis on
preventing what Dave calls "risks of commission." I feel
that the risk of commission
on the destruct system is extremely small, while
the risk of omission
is somewhat higher, although still small. During
validation testing and
in every pre-launch checkout, we performed "exhaustive"
checks -- "exhaustive"
meaning that we tried every combination of
[(2 central computers)
* (6 remote sites) * (2 computers per site)
* (2 transmitters per site) * (2 comm paths to each site)
* (2 possible commands in various sequences)].
Yeah, this takes a *LONG*
time (with practice, we got it down
to several hours if
everything went smooth.) On one occasion during
validation testing,
we did find a software error which only manifested on a
particular (central
computer/comm path/remote computer/unusual command
sequence) combination.
Exhaustive tests *are* necessary.
I have often wondered
why the emphasis was to prevent errors of commission
over errors of omission
(not to say that we wanted either kind, but errors
of commission were definitely
considered to be worse!). An erroneous
destruct would cost
the lives of the flight crew, loss of the Orbiter, and
possibly damage on the
ground if it occurred early in the flight (e.g.,
windows blown out, etc.)
An erroneous non-destruct, in the worst case (if
the ET were to detonate
near the crowded spectator area on the NASA
causeway), could cause
the loss of TENS OF THOUSANDS of lives. Certainly
this is worse than an
erroneous destruct. I believe there may be a
subconscious feeling
that an erroneous destruct means the difference between
a success and a disaster,
while an erroneous non-destruct means the
difference between a
disaster and a worse disaster. Subjectively, that
difference is not as
great as the first, although objectively it may be much
greater.
Martin Moore
<The usual disclaimers. I'm too tired to type in the whole silly thing.>
[By the way, Dave Parnas suggested the following example to
illustrate his message in RISKS-2.28:]
"Consider elevators. Consider how much easier it is to prevent the
floor indicator from saying "13" than to assure that the floor
indicator will always give the actual floor that the elevator is
on. The risk of indicating "13" can be gotten acceptably low by
eliminating "13" from the set of indicator lights. The risk of
indicating an incorrect floor or not indicating the current floor
is much harder to eliminate." [Dave Parnas]
----------------------------------------------------------------------
----------------------------------------------------------------------
Date: Tue 4 Feb 86 12:34:09-EST
From: Marc Vilain <MVILAIN@G.BBN.COM>
Subject: Shuttle
computers
To: risks@SRI-CSL.ARPA
The following is excerpted
from this Sunday's New York Times. It may
be somewhat old news
to some, but does a good job of summarizing much of
the evidence and arguments
surrounding the Challenger's computers.
SHUTTLE EXPERTS DOUBT COMPUTERS COULD DETECT FIRE
By David E. Sanger
The computers
and sensors that guided the flight of the space shuttle
Challenger appear not
to have been programmed to detect flames burning
throught the sides of
a solid-fuel booster rocket, experts familiar with the
shuttle system said
yesterday.
Their comments
came as evidence accumulated that the right-side booster
began to fail as much
as 10 seconds before the explosion that destroyed the
craft, as reported yesterday
in the New York Times.
Even if
the sensors had picked up the first signs of fire, safety
measures built into
the system to protect the astronauts would have
prevented the shedding
of the giant external fuel tank that exploded soon
after, NASA officials
and the computers' designers said.
Only From Pilot
That command
could have come only from the pilot, and officials said they
doubted even that could
have saved the crew.
...
Experts
who have studied the shuttle's computer system say it was not
programmed to separate
the orbiter automatically from its fuel supply in
part because of the
fears that faulty sensor readings could cause the
computers to abort a
mission unnecessarily, risking the lives of the crew.
Preparation for Emergencies
Still the
possibility that there were signs of trouble as long as 10
seconds before the explosion
raised some questions yesterday about the
enormously complex equipment
that guides the shuttle.
...
"The possibility
that a booster might burn through could well have
been a failure mode
that was never considered," said Alfred Spector, a
Carnegie-Mellon professor
who two years ago conducted a study of the
computer system guiding
the shuttle.
NASA officials
said little publicly in response to the report that
data sent from the shuttle
showed a sudden drop in the power of the
right booster rocket
about 10 seconds before the spacecraft exploded.
But computer
experts said the computer's response to such a power drop
may have been executed
flawlessly. The program, they said, was primarily
designed to correct
for the effects of an uneven rocket thrust by swiveling
engine nozzles to the
side, keeping the shuttle on course. Sources close to
the situation say that
the ground data show that the nozzles had in fact
swiveled to one side.
In the absence
of other danger signals, however, the computer would not
have searched for the
cause of the power loss. And the initial signals
apparently indicated
only a 4 percent decrease in thrust, a figure that the
computer, or the cabin
crew and officials at the Johnson Space Center in
Houston, may have judged
did not indicate a serious problem.
...
[End of
excerpt]
----------------------------------------------------------------------
----------------------------------------------------------------------
To: risks@sri-csl.arpa
cc: space@s1-b.arpa
Subject: SRBs
and Challenger
Date: Mon, 03 Feb 86
21:06:59 -0800
From: Mike Iglesias
<iglesias@UCI.EDU>
According to this morning's LA Times:
- Early shuttle
flights had sensors on the SRBs to monitor performance,
but they
were removed to save weight when it was felt that the SRBs
were performing
well. The sensors monitored pressure, temperature
and vibration
in the SRBs.
- Two Rockwell
officials familiar with the NASA inquiry said that NASA
data shows
that the 3 main engines experienced a power loss just
before
the explosion. The power loss was noted between one-tenth and
one-one
hundreth of a second before the explosion. The SRB that
probably
caused the explosion suffered a 3% loss of power (about
100,000
pounds of thrust) seconds before.
- NASA noted that
even if there were sensors on the SRBs, little can
be done
to save the crew if there is a problem during the first 2
minutes
during the flight. They might be able to jettison the SRBs,
but it
would be difficult to stay clear of them and the external
tank.
And another NASA spokesman said later that the crews don't
train for
that maneuver, and that NASA documents state that such
an escape
is possible only after the SRBs have completed firing.
The shuttle
would have a near-impossible task of ditching in the
ocean if
it was able to steer clear of the SRBs and the ET.
- Other Rockwell
sources said that telemetry shows that the external
tank experienced
an increase in pressure in both the oxygen and
hydrogen
tanks, and that pressure relief valves in both tanks
popped
to decrease some of the pressure.
Could the crew have survived
had they known about the problem? Who knows?
Maybe, if they had known
about the SRB problem in time, if they had been
able to get away from
the SRBs and the ET, and been able to ditch successfully
in the ocean.
That's a lot of ifs...
I wonder if NASA is going to think twice about removing sensors after this...
Mike Iglesias
University of California,
Irvine
----------------------------------------------------------------------
----------------------------------------------------------------------
Date: Mon, 3 Feb 86 07:14:54
pst
From: malloy@nprdc.arpa
(Sean Malloy)
To: RISKS@SRI-CSL
Subject: Solid
Propellants and What the Computers Should Monitor
>Date: Sun, 2 Feb
86 14:08:17 est
>From: mikemcl@nrl-csr
(Mike McLaughlin)
>Subject:
SRBs and What the Computers Should Monitor
>Another result
could be that the errant jet impinged on the main fuel tank,
>heating, penetrating,
and igniting the fuel load. (It might be able to ignite
>it without penetrating
the tank structure.) *This should be quickly detec-
>table by excursions
in tank pressure.* Reaction times, even of computers,
>might not be
fast enough to make any difference in the outcome.
>I believe that both
of the above could have been detected with instrumentation
>that was certainly
on board. Additional (or existing?) instrumentation could
>detect temperature
changes in SRB and fuel tank skins, torques on SRB mounts,
>abnormal "seismic"
vibrations within the SRB structure, abnormal "plumes",
>etc.
One of the points that
was brought up during the broadcasts the day of the
disaster was that the
telemetry tapes were going to have to be analyzed to
determine if there was
any indication as to what happened. The temperature
data for the external
tank was specifically mentioned as one of the
telemetry streams that
was NOT fed to a display in either the launch control
area or Mission Control.
The NASA spokesman explained that there was so much
information coming in
that a decision had to be made to limit what the
launch control personnel
had to pay attention to.
This brings up a much
more subtle problem in risk evaluation -- what data is
considered relevant
to the task at hand? A line has to be drawn between
significant and extraneous
data, based on the processing capacity of the
system/personnel interpreting
the data. NASA had decided that the ET
temperatures were not
of immediate use to the launch control personnel, and
simply recorded the
data. In the previous 24 shuttle launches, they were
right; in this case,
they were wrong. In the future, they probably will have
someone monitoring that
data. What also has to be considered in the decision
is what can be done
on the basis of a given stream of data. I don't know how
long the ET temperatures
would have been elevated before the explosion, so I
don't know whether there
would have been time to recognize the problem,
identify the source,
and jettison the SRBs. If you can show that there won't
be enough time to react
properly, then giving someone responsibility for
making the right decision
in that situation is asking someone to volunteer
to have a nervous breakdown.
In retrospect, there
should have been immediate scrutiny of the SRB
performance. Looking
at the pictures of the exhaust trails after the
explosion, one of the
SRBs is looping away from the blast apparently
undamaged, while the
trail from the other proceeds straight for a
short distance, then
peters out abruptly. Why would one survive
unscathed while the
other one was badly damaged unless something
happened with or adjacent
to the SRB? Hindsight is always 20/20.
Sean Malloy
(malloy@nprdc-arpa)
----------------------------------------------------------------------