A few posts ago I wrote about a possibility for using big data in code reuse. On the same page, I mentioned the possibility of using the resulting system for optimization. At least one person who reads my words here asked me, "But why is that so important?"
And as I was reading one of John Shutt’s articles on his blog, I worked out how to express this. John was writing about how abstraction works in mathematics. In mathematics, it is not uncommon that you have three or four or five different ideas which are related to each other such that given any one of them you can infer or construct any of the others. And if you then infer or construct the original from that, you really and truly do arrive back at the original. But that’s mathematics. It doesn’t work that way in computer science. John believes that this is a problem with how we have designed our means of abstraction. I believe that this is a problem with the way we implement our means of abstraction. Because of the way we build functionality out of other functionality, there is an additive bias in our means of abstraction; that is, to get from one idea to another we have to add code. There just isn’t any means of writing negative amounts of code when you have an abstraction provided as a primitive and you only need a small subset of it.
If it’s a library rather than a primitive, then it’s possible to pick just the part you want out of it; but in order to do so you have to break the encapsulation principle that is fundamental to the way we control complexity, and you have to break it hard. You’d be picking apart someone’s implementation, cutting things out of it that you don’t need, and putting the remaining parts back together. That means you’d need to understand the details of its implementation, and get elbows-deep in modifying or cutting its interfaces as well as its private data members. We are practically given as doctrine that nobody should ever have to touch the private implementation of a well-written object class in order to use it for anything, because that’s fundamental to modular programming. If we dive into that code with the idea that maybe we can do better by cutting code out rather than adding it, we’re abandoning our primary methodology for controlling complication.
Computer Science also has fixed primitives. Even if three things are conceptually the same and a mathematician could pick any one of them as the starting point for abstraction, a computer scientist does not have that luxury. In a computer there is only one ‘primitive’ in terms of which everything else is constructed, and that is the assembly language accepted by the CPU. But we don’t usually use systems where that’s visible, or in reach. We use what are laughably termed "high level programming languages" that provide a different set of primitives, which might possibly have been based on assembly language but which were probably based on the primitives provided by yet another "high level" programming language instead. And we can’t usually pick apart the primitives provided by our language environments, at least not from within the programming languages; if we could, then we could easily violate the conditions that other "primitives" in a particular set depend on for stability or validity.
Finally, computer science cares about more than just semantics. We also have to measure performance requirements against abstraction costs. Every path of abstraction has a different cost in performance and space and complexity. So if we start with assembly language, derive A from it, then derive B from A and C from B, we have reached A with one layer of abstraction, used two layers of abstraction to reach B, and used three layers of abstraction to reach C. Each abstraction layer has a cost in time, effort, and complexity. The cost depends on what order you do it in. But odds are C will run an equally-complex problem much more slowly than A, because we’ve reached C by adding more code than we used to reach A.
But that’s a complicated statement of the situation. Here’s a simple one.
CS both has an additive bias in its abstraction mechanisms and, due to the fact that it has costs, cares about it. Math doesn’t.
In CS, we have to worry about three things that mathematics doesn’t have. First, we have to worry about efficiency in time and space and effort. Every layer of abstraction costs time and space and effort. If we have primitive numbers that include complex values, then derive real via abstraction and complex numbers via abstraction on reals, we are using at least twice the representation space we need, and wasting at least 75% of our effort. Even with identical semantics, no computer scientist could possibly be satisfied that we have really reached the "same thing" again.
That brings us to the Second thing we need to worry about, which is managing abstraction complexity. Mathematicians can show a very complex relationship between two things, and then treat the second thing as a primitive unto itself. CS can do a similar thing with a very complicated implementation of the second thing, but that complicated implementation keeps on costing us overhead long after we discover it.
We try to reduce complexity by providing a simple interface and hiding information about the implementation from other parts of the program, in order to make an abstraction that can be used additively. As a way of managing complexity, it (at least sort of) works.
Third, we have to worry about our additive bias — We can easily see when we can add code to get to something we want, but it is very hard to take unwanted code away in order to get to something we want. In fact the whole OO paradigm is based on trying to make programs by simple addition of code rather than by allowing the deeper integration/understanding that might result in taking code away, because that deeper integration is complex and we have to manage complexity or go stark raving bonkers.
So this is the pain and vicious cycle of types as seen in CS. We care, and care deeply, about something which, by emulating the abstractive pattern of mathematics as closely as possible, we both waste and make ourselves blind to wasting.
Adding code is "just what you do" when you’re faced with a problem. But by the time they’ve both built nine abstractions each on top of the last, a mathematician is making perfectly good progress and a computer programmer gets fired because his or her code is running like ants stuck in honey.
To achieve parity with mathematics in abstractive power, computer science needs a tool that fully integrates instead of fully separating, eliminates redundant and unnecessary infrastructure, and gives us, for this example, complex numbers that truly are the same no matter what pile of abstractions we’ve gone through to get them and real-number object code that actually does less than the complex number code it’s derived from rather than doing more and ignoring the unwanted parts.
But, that would be the mythical "sufficiently smart compiler" that Turing proved can’t actually exist (see also: Halting). It’s a problem.
I proposed the "Big Data" approach to this not because I think it’s going to solve the Halting Problem. It won’t. The Halting Problem is real and eternal and infinite. But we can collect and distribute solutions to finite subparts of that problem, and it would do our field a lot of good. To the extent that we are successful, abstraction in programming will become a lot more like abstraction in math.