Cybernetic Entomology: Valgrind versus Debian.
Valgrind is a software development tool that inspects and analyzes code and does a fair number of things intended to make it easier to develop correct software. These things include issuing warnings about code that relies on the contents of uninitialized memory.
On 14 May 2006, a Debian maintainer relying on Valgrind without really understanding what it was complaining about, made a mistake which crippled OpenSSL’s otherwise reasonably good random number generator. This bug was part of the Debian distribution for two years. The fix was announced on 13 May, 2008.
There were two calls to a routine named MD_Update in the file md_rand.c. This routine took a pointer to a memory buffer as input and updated the state of the random number generator by mixing in the contents of that buffer. One of them was in the routine ssleay_rand_add, which was specifically for adding reliably unpredictable values to the state of the RNG, and the other was in the routine ssleay_rand_bytes, which was supposed to fill a buffer with RNG output.
In terms of random number generation, what MD_Update did is something you want to do whenever you have an opportunity; it never hurts. And every time ssleay_rand_bytes got a buffer it was supposed to fill with unpredictable numbers, that was a buffer that might (or might not) have some contents already. So there was an opportunity, and it called MD_Update to take advantage of that opportunity. It would read the contents of the buffer and, whatever they were, mix them with the state of the random number generator.
Valgrind really didn’t like this. From the point of view of code that was calling for random numbers, the contents of the buffer before it got filled with random numbers didn’t matter. It made nonsense to developers to think they should initialize a buffer before filling it with random numbers, and of course they didn’t do that. The fact that the buffer was getting read, and that the value read from it was getting actually used for things that affected further computation, looked like exactly the sort of thing Valgrind was supposed to warn about. So Valgrind complained, bitterly, every time it read code that called ssleay_rand_bytes to fill a buffer with random numbers, because it would then call MD_Update to mix whatever contents might possibly already be in the buffer into the random state.
When generating unpredictable numbers, every little bit helps. Mixing the contents of that uninitialized buffer into the random state was not a mistake. Removing that call would mean doing without that particular potential source of unpredictable numbers, but that wouldn’t have been a serious sacrifice. If you wanted to shut Valgrind up so you could find messages about things that were actually errors, you could just comment out that line. In fact it was conditionally undefined so it wouldn’t be compiled if building the system for Valgrind, and it was marked with a comment that says “Valgrind complains”, acknowledging that that complaint was known and normal. A decision made at Debian to delete that call was sort of random, but not completely unreasonable. They wanted to shut Valgrind up so that the messages it produced would actually signal errors, and they decided that the logistics were such that the conditional compilation flag wasn’t enough.
But, as I said, there were two calls. One got called whenever a buffer got passed to ssleay_rand_bytes to fill with random data, just to opportunistically harvest whatever might be in it. The other, in ssleay_rand_add, was on a critical path and it was how the RNG got its primary loading of unpredictable state. It was called with pointers to buffers known to have values filled with high-quality unpredictable data read from hardware sensors, network timings, et cetera.
And, in just one place – by an actual mistake – this second call was also getting invoked with an uninitialized buffer. This was a place where Valgrind’s message actually did signal a mistake, of exactly the kind that Debian’s developers were afraid that they might miss in all the noise generated by the other call, and which they were in fact missing in all the noise generated by the other call. But nobody knew that yet.
So, there was a lot of debate and discussion on the public OpenSSL development list, about the wisdom or utility of uninitialized data as a source of entropy, and an eventual decision by Debian to remove the call that was offending Valgrind at literally every use of OpenSSL to fill a buffer with unpredictable numbers. At the same time, the actual maintainers of OpenSSL had abandoned the list where this debate and discussion was taking place; they found it too annoying to deal with the public and the endless questions and suggestions of the uninformed masses, and had moved their own discussion to another list, where they never heard about all of this discussion and debate.
And after a decision was made without their input, a Debian developer went into the file intending to delete the call that Valgrind was complaining about. And, when he had deleted the call that generated most of the complaints, in ssleay_rand_bytes, Valgrind was still complaining about a call at another line, in a different routine called ssleay_rand_add, which looked identical to the one he had just deleted except that it was lacking the conditional compilation and the note that “Valgrind complains.”
After that, the only thing going into the RNG state was its very first initialization – the process ID number. And the process ID number was only 15 bits.
It is hard to overstate the severity of this bug in security terms. I’ll try to explain briefly just how bad it really was.
In the first place it meant that OpenSSL’s random number generation was ridiculously weak on Debian builds. That much was a given. That meant that for two years, any SSH or SSL session negotiation with a Debian or Ubuntu machine on either end, negotiated a session which could be cracked easily. That affected even people using certificates with good keys, if the server OR the client had a broken RNG. But it also meant that keys actually generated on Debian builds were trivially weak. That, in turn, meant that the SSL certificates issued to everybody who had generated their keys on Debian machines – at hundreds to thousands of dollars each – contained ridiculously weak keys. That, in turn, meant that every SSL connection made using those certificates, even between machines whose OpenSSL implementations were fine, was also easily breakable, exposing billions of dollars worth of commercial and customer data. So, in a few short sentences, we’ve gone from “weak random number generator” to “invalidated or compromised pretty much every SSL key generated on or for affected systems for two years and every SSL connection made on those keys” to “and even exposed the sessions of people using GOOD certificates.”
The short version of the story is that this bug essentially destroyed the security of the entire SSL infrastructure, for a lot longer than two years. It is now eight years later and there are STILL certificates in use, generated on systems whose OpenSSL configurations were fine, whose keys can be recovered from just one recorded SSL session where their certificate was used by a Debian machine.
There is just no way to know how much of what was stolen during that two years – or the years after that while a lot of those ridiculously weak or compromised keys were still in use – was stolen as a result of this bug. If there were a way to know, then probably everybody involved in Debian’s OpenSSL maintenance would have been sued out of existence, and everybody in OpenSSL’s primary team at least would have had to defend their defecting to a private list from charges of criminal negligence. This didn’t afflict just a few systems with a shortage of memory or a lack of good entropy sources; this was literally every Debian and Ubuntu build for two years. This is the kind of bug where the monetary damages if only they could be proven, easily run into billions of dollars and possibly into trillions. And when our knowledge of how to track and account for and above all prove this kind of damage is more complete, lawsuits of that magnitude are going become the normal response to bugs of this magnitude.
So what are some of the take home lessons?
- Static analysis tools are not a substitute for thinking.
- Code review is vitally important for changes to security code.
- In a case where you really don’t care what the value of data is, it is not a bug to read it uninitialized. Such cases are rare but not impossible.
- A decision to make a “minor” change in security related software can result in changes that aren’t minor at all if applied without completely understanding what is being changed and what it’s used for.
- If you develop something that has a public development mail list, you should be aware of the discussion there – if you can’t spare the developer time, try to find a volunteer or an employee who is willing to stay abreast of it and bring anything they think is important to your attention.
- Assuming that some call to a function can be harmlessly deleted because you’ve carefully determined that an entirely different call to the same function can be harmlessly deleted, is a fallacy.
- When you’re deleting a source of error message “noise” so that you can find real errors, error messages that continue to be generated after the deletion are more likely to be messages about real errors than messages that indicate you haven’t gotten rid of all the “noise.”