Cybernetic Entomology: Failover that Fails.
In 1990, AT&T had deployed a new kind of telephone switch that sent out a message on a control channel when it failed. The switch receiving the control message would take over the failed switches’ load.
This worked reasonably well when switches failed for other reasons but hadn’t been thought out very well for the instance in which switches failed due to being overloaded.
Naturally a switch attempting to take over the load of an overloaded switch would also overload. To this day no one knows which of these switches was the first to experience overload, but on January 15 1990 the entire AT&T long distance system went down in a cascading failure, ultimately overloaded by its own overload messages. It had to be restarted completely from zero.
Checklist items:
- When building something that is supposed to be a failover or backup, be sure that it is not subject to failure from the same causes that induced the original failure.
- When building a system that is supposed to help recover from failures, be sure that nothing it does contributes to or is impaired by that kind of failure. This is seen as above in systems that attempt to recover from traffic overload by sending more traffic, and also in systems that try to recover from memory allocation failures by calling routines that must allocate.