Cybernetic Entomology: Clock Drift and Coordinated Action
Hard real-time systems are called hard for a reason. It is not easy to get them right.
In 1990 and 1991, US Soldiers deployed as part of Operation Desert Shield relied for their lives on protection from Scud missiles provided by Patriot Anti-Missile batteries. While in almost all respects the Patriot system is far advanced over the Scud and entirely capable of providing the required protection, it was designed as a mobile anti-aircraft system and, in Desert Shield, deployed for the defense of stationary installations against ballistic missiles. There was a failure of the system at Dhahran, Saudi Arabia, on February 25 1991, when a Patriot battery failed to track and destroy an incoming Scud missile. The missile hit a barracks and 28 soldiers lost their lives.
How did this happen?
Clock drift in a US MIM-104 Patriot battery caused its internal clock to lose 1/3 of a second during 100 hours of operation. When a radar warned it of an incoming Scud missile, the Patriot battery looked at the part of the sky where the missile had been 1/3 of a second previously, found no missile, and discarded the radar information as being an error. It failed to launch an interceptor and the incoming missile got through.
But, wait a minute? Clock drift, in mil-spec hardware? Mil-spec is manufactured to high degrees of reliability and robustness; you wouldn’t expect clock drift to be an issue. As it turns out, this clock drift was a programming error. The Patriot system keeps its time as a 24-bit fixed-point number, And the clock was updated ten times a second. You cannot represent one-tenth in binary. The number that was actually being added to the count of seconds ten times a second was 209715/2097152, which was as close as they could get within the constraints of their representation. The radar used a different time system; it had its time constantly updated from a GPS satellite configuration.
This difference hadn’t affected operations before that point because the difference is small and the Patriot’s time was set every time the system came up. But in Dhahran, the soldiers were under constant threat for an extended period and had kept their Patriot battery up for four full days, which meant it had been four full days since the clock was set. The Patriot system lost 2/2097152 seconds every second even when its hardware was functioning flawlessly. After 100 hours, that adds up to 720000/2097152 seconds, or just over 1/3 of a second. So after 100 hours of continuous operation, the next incoming Scud got a free pass.
The Patriot was originally designed as an anti-aircraft system, and aircraft are usually a lot slower than missiles. So “small” discrepancies in clock values would not usually make so much of a difference as to cause the Patriot battery to entirely fail to track the target. The Patriot was also designed for mobile deployments where it was important for it to evade detection, where it would not have been in operation for more than a few hours at a time. And had it been used for the purpose for which it was designed, this would not have been a problem.
But when there’s a need in the field and something is the best thing available to fill that need, it gets used in that capacity. In Desert Shield the need was for protecting civilian targets and non-mobile military installations from Scuds, and at Dhahran, Saudi Arabia, on February 25 1991, the system had been in operation for four days.
After deploying the Patriot system in this drastically different capacity, its shortcomings had been noticed, not by American forces, who were entirely familiar with the Patriot and had accepted its behavior as normal, but by Israeli forces to whom it was issued as part of Desert Shield; The Israelis being less familiar with the system paid more attention to it. In particular, they hooked up their data recorders and observed statistical patterns. Just two weeks before the incident, a bug report had been filed based on Israeli data that showed inaccuracy as a function of time since boot. The problem was analyzed and a software update was made available. But it’s difficult to arrange transportation and software distribution in a combat zone, and the update reached Dhahran literally one day too late.
Checklist items:
- Software patches when needed are good – but not as good as getting it right.
- When deploying something in a capacity different from that for which it was designed, pay special attention to bug reports filed by new users. They are more likely to notice new problems.
- When different parts of a system need a precise agreement about something (in this case time), make sure that they are setting it from the same source and updating it from the same source or at least in the same way.
- If different parts of the system are not updating from the actual same source, then make sure that there is a feedback channel between them that forces adjustments to bring them back into agreement.
- When dealing with fractions, you have to pay attention to numeric methods. Numeric Methods books are mostly about round-off errors and techniques to combat them or minimize their effects. In this case, basic numeric methods would have told you to do one of these things:
- Update 8 or 16 times per second rather than 10.
- Represent time in units evenly divisible into tenths of a second.
- Add a small adjustment every tenth tick.
- Or, use a wider representation for time, depending on the degree of acceptable error. Nobody should ever have the job of explaining “degree of acceptable error” to 28 grieving families, but acceptable, in these terms, would be a level of clock drift that varied by a tenth of a second or so over an entire year of uptime. That could have been achieved by using a 32-bit rather than a 24-bit counter.