Category Archives: Big Data

Big Data and Code Reuse.

Code reuse is highly desirable; the idea that instead of cooking something up from scratch you can just draw the right bits & pieces off the shelf, already written, debugged and working, is very attractive and can make us very productive if we can do it.

Very few programmers can reliably do it. The way it usually works is that a programmer has personal familiarity with some codebase from earlier work, and can draw on that resource to reuse code. But nobody has that kind of in-depth knowledge of a sufficiently large base of code to be able to just go and get the right thing more than a tiny fraction of the time. When they do know something that’s exactly the right code, depending on licensing issues it may or may not be the right thing legally for the current project. And if it’s the right code semantics and it’s the right code legally, then it’s usually for the wrong system, or for the wrong version of the system you want it to run on. When your software vendor makes constant changes in your infrastructure, the code that solved a problem three years ago has often spoiled like fresh fruit before it can meaningfully be reused. And if it’s the right semantics, and the right license, and the right system, then it usually introduces dependencies that conflict with the rest of the program you’re working on or else contribute heavily to its bloat by duplicating infrastructure. The duplication of infrastructure creates more coding costs in the long run because then you have to maintain your duplicated infrastructure, which multiplies the cost of adding every feature and of fixing every bug.

So in practice, most code gets rewritten many times instead of having been written once by some guy in Peoria and then used whenever and wherever it’s needed from Peru to Kamchatka. At a guess, probably less than one percent of all program code ever written has actually been reused. And that one percent is what we call the documented libraries of a development system. Python has what may be one of the most comprehensive of documented standard libraries.

The problem is mainly code librarianship. Properly documenting a library is hard. Properly choosing which and how much code to document and include is hard. With a “reasonable” effort at code librarianship you can get decent “coverage” of a topic, meaning you can make your library useful so that it’s easy to write code that uses it to accomplish particular tasks. But attempting to support full code reuse wherever applicable, with a library that actually has and documents the most popular few thousand or most popular few tens of thousands of the particular individual tasks that people want to do with a particular type of functionality, seems impossible. It is cost prohibitive – both in programming effort to create it and in the effort of organizing and documenting the code. If a company tried to produce such a library they’d go broke before the first copy sold. And then you have another problem, which is on the user side. People searching for the code they want have to be able to find it, and exact indexing gets exponentially harder with linear increases in semantic complexity of the indexed code.

So it’s easy for people to find code related to what they want, but still very difficult or impossible for them to find exactly the code they want. Since the necessary search is exacting and would have to be a search for code semantics rather than keywords, existing search engines are solving a related problem but not solving exactly the problem we need to solve. Natural-language descriptions of code’s semantics always require close and careful reading to determine whether the code is in fact what you’re looking for, and often leave the question undetermined anyway. Attempting to search for exacting semantics by searching the natural-language descriptions of semantics for keywords is a lot like trying to find a needle in a pile of iron filings when all you have is a magnet.

Even in cases where the most popular few tens of thousands of things are actually fully coded and working, (which has happened with CPAN and CRAN), finding exactly the right thing is difficult because the indexing cannot with any reasonable effort be comprehensive enough or precise enough.

Code librarianship is hard. The reason the kind of exact semantic search we want for real code reuse simply doesn’t exist is because the code itself is one of the most concise ways to express those exact semantics we’re searching for! In other words, if you have the needed search string, then you have the object you’re searching for. And, in fact, because the same semantics can be expressed in many ways, the search string you have is likely to be one that is not used to index the thing you’re looking for even though it could be.

The result is that a lot of code is just mindless duplication of other code that the implementor could not find or didn’t even bother to look for. A lot of good reusable libraries (indeed, perhaps most) do not in fact get reused because nobody can find them.

There is hope however. One of the features of so called “Big Data” is that if we design well, indexes can be built from content and use rather than by hand.

The idea I have in mind is fairly simple: what if the fastest and easiest way to find exactly the code you need were to just start writing it? If the IDE automatically searches its library for matching code when you start writing, you can find automatically the library code that, with a big library, you’re inevitably rewriting.

Thus, you could write a routine, or even just the first few lines of a routine (say, queue insertion) and the code-librarian subsystem would look at it, search its library for matches, and say, “Hey, it looks like QUEUE_INSERT from the Fodunk library; would you like a compatible QUEUE_CREATE, QUEUE_DELETE, QUEUE_HEAD, and QUEUE_DEL_HEAD from the Fodunk library?” And if you say yes, it provides them. And if the code-librarian subsystem doesn’t find any matches, you go ahead and implement them yourself and you haven’t wasted your precious human time searching.

There will always be some vectors of code that a semantic matcher won’t be able to rule about (due to the halting problem, incompleteness theorem, etc). But if coverage steadily increases, then the percentage of such failures can be driven arbitrarily low.

And there is a way to make the coverage steadily increase. You could have the IDE talking over the network to a library code server in addition to locally installed libraries. When debugging, the semantic analyzer will frequently report both matches and failures to match. Semantic matches indicate similarity of intent, and semantic mismatches mixed with semantic matches indicate likely bugs.

For example, if a program has a queue_insert that matches a known queue_insert function, and a queue_delete that does not, then it is likely that either there is a bug in the queue_delete function, or that the semantic match has failed. When a programmer resolves the issue, either it is useful to him because he has found a bug in the code, or it is useful to many other programmers because he has either found a bug in the library or extended the capabilities of the semantic matcher. In the last two cases, having his IDE talk to the library code server can use this valuable information to the benefit of all users of that system.

If the programmer identifies two routines as having identical semantics regardless of visible differences (ie, if semantic analysis failed to find the match) then the semantic analyzer can be extended with new information (subject to extensive fuzz testing, because the programmer might be wrong) and get the analysis right the next time, for the next user, a thousand miles away on the following day. Fuzz testing can never make us absolutely certain of equivalence, but it can at least reassure us that non-equivalence is limited to some arbitrarily tiny fraction of inputs.

Second, if the code in an IDE is being written under an open-source license, then the code library server can index the code. And if it becomes widespread (ie, if a sufficiently large number of known projects have identical routines in them) then it can become part of the standard library under an open-source license, whereupon the sufficiently large number of IDE’s across the world detect the semantic match between user and library code and offer to replace existing routines in projects with a library call.

Finally a semantic matcher could keep alert for opportunities to unify code; with an index of semantic identities, it is uniquely equipped to detect duplication of code within a project as well as duplication of code in the libraries. Thus, it could provide a valuable tool in refactoring the ‘bloat’ that comes along with duplicated infrastructure when code actually is reused.