I’m starting a semi-regular feature on this blog dedicated to computer bugs. When I have written a lot, it will become a book.
The business of making anything very large and very complex also robust and reliable is hard. Software that represents hundreds (and sometimes thousands) of man-years of human effort, all of it devoted to managing complexity by the use of programming languages, is by any meaningful measurement both large and complex.
There are a lot of large, complex things in the world whose robustness and reliability we rely on for our lives and livelihoods. We build bridges, buildings, dams, power plants, vehicles, sewer systems, power distribution systems, waste treatment plants, communications networks, legal systems, and medical equipment the failure of which could cost many lives and/or enormous amounts of money.
And we need to be able to build computers at least as reliable as those systems. For the professionals in the industry, the quality of the software we produce is already something we rely on for our livelihoods. For our customers, it is what they rely on for their entertainment, for their communications, for their productivity, usually for their own livelihoods, and sometimes for their lives as well. And for everybody, software is increasingly a necessary part of the operation and/or construction all of the other large, complex public systems that we are all forced to rely on for our lives and livelihoods.
Doing good software QA has so far seemed to be an art that exceeds human understanding. No matter how many things you remember to check for, a new bug will always be one you didn’t check for. Bugs always appear in new contexts and new systems, and their variety seems infinite. There just isn’t a checklist you can run down to see that your software is secure or bug free.
But I started thinking about that, in the context of a flight instructor who had explained, in detail, that the pre-flight checklists that pilots and crews use before takeoff are figuratively written in blood. Every item on those lists is an item which has failed or been left undone on a previous flight usually with fatal results. These items are on the checklists because specific people died in specific crashes as a result of not checking them. Aviation checklists are a pilot’s way of trying not to die of a mistake another pilot has already made.
Aviation checklists will never be a way of certifying the complete safety of aircraft. Failures of design, maintenance, and pilot alertness can render any aircraft unsafe. So can sabotage, enemy action in wartime, hijackings, bird strikes, obstructed runways, re-entering satellites, and random falling cows. No matter what you check for, a new aircraft crash will be caused by something you didn’t check for. Malfunctions and failures always appear in new contexts and new systems, and their variety seems infinite. There just isn’t a checklist you can run down to see that your aircraft is safe.
But the checklists help. A pilot who checks to make sure of the things that are known to have killed other pilots, or helped them to survive, is in fact much less likely to repeat fatal mistakes. Things caught as pilots go down the checklists have saved a lot of lives. And a pilot who is aware of the things on the checklist is more aware of how his or her plane works and what conditions must be met for its safe operation in general.
That leads me to an approach to software QA that I don’t think I’ve seen much attention devoted to. Software QA books are mostly written in the form of “ought to’s” and “shoulds” and “I think this practice will reduce this kind of error” and other such propositions. We need to do that kind of thinking, but we need to do it in the context of a body of specific, citable evidence in the form of cases, against which that advice can be checked. I can’t easily think of anyone who has sat down and looked at a big set of real evidence – meaning specific bugs that actually happened in the context of specific software – and tried to draw from them a list of items to check.
Instead, people have attempted to extract very broad principles and patterns that apply to all bugs, and in some cases the result has been on the level of personal opinions and platitudes. This is because the topic is so large and complex that confirmation bias can confirm any pattern that people expect to find and people can see in it whatever they’re prepared to see in it or hope to see in it. It takes very disciplined thought, or obsessive attention to minutiae, or both, for a human being to see instead what is actually there.
So, I’m proceeding to the other end of the spectrum from the kind of software QA books I’ve seen in the past. I’m not going to try to deal with the data as a huge undifferentiated mass, and present a general system or methodology that should help with everything. I’m not the kind of super-genius who could look at all the data and understand it all at once, and see what’s there instead of what I expect to see, and develop a methodology to find or eliminate all bugs. Even if I did understand all the data at once, I don’t believe that I could derive from it an underlying principle that explains or can be used to prevent the origins of all bugs, because I don’t really believe such a principle exists.
So instead of trying to be that brilliant in terms of a big picture, I’m going to concentrate what may be obsessive attention on minutiae. I’m going to try to deal with the data as individual cases, and try to extract at most a few specific applicable lessons from each case. I’m going to try to make a meaningful checklist.
This proceeds at least in part from my belief that actually reviewing cases of real bugs, in the context of thinking about what caused them and what could have warned of their emergence, will help people to understand how and why bugs happen in a broader sense, and to think more broadly about how their software works and what conditions have to be true for it to work well. Checklists make planes safer, I believe, at least as much because they promote a more general understanding of the aircraft and its design and its needs, as because of the individual items they remind people to check. In a similar vein, this is a work which hopes to promote a more general understanding of software and how errors happen to it and affect it, as much as a checklist of specific items and lessons.
I hope to help raise the bar of basic knowledge to the point where we can have more meaningful, informed, evidence-based discourse about the broad software QA principles everyone tries to derive, or about general practices to help us avoid the circumstances and conditions in which bugs arise.
So: Welcome to my Journal of Cybernetic Entomology. I hope you like bugs, because we’re going to meet a lot of them.