Unicode may as well be regarded as a fact of life at this point in history. It exists, it is an accepted standard, it will most likely exist for a long time, and its basic structure is not likely to change. In a lot of ways Unicode is a very sloppy compromise between what was needed and what was easy to transition existing systems to. It has been accumulated by assimilating previously existing standards preserving all the quirks and special cases of all those standards, rather than designed from the beginning to address the same needs as those standards while having properties that could have made it much easier to work with.
For several reasons, Unicode is doomed to never, ever, not in a billion years, reach a final version, so nothing can ever be written about it as a fully stable entity. Still, whether we like it or not, it is a far far better and more reliable way of encoding a comprehensive character set than we had prior to its development.
I’m going to write today about using Unicode in programming languages, both as source code and as data. It is missing some features that would have made it better for programming and it has some features that make it actively bad for programming. Still, its vocabulary of characters is compelling for source code, for three reasons.
- It includes mathematical symbols that are widely understood by programmers and engineers and have unambiguous established meanings that are useful in programming languages.
- It allows people to write identifiers and syntax in their own languages who could not do so before.
- Because you will have to have the code to handle Unicode as data anyway, you might as well handle it as source code. This might amount to some sort of programmer’s version of Stockholm Syndrome, but there it is.
On the negative side, it has a whole bunch of problems.
- First, there are a lot of codepoints that are not in use. They do not correspond to any kind of valid meaning. And these codepoints are not in a single large unallocated area – although there is one. Instead, there are hundreds, or possibly thousands, of relatively tiny unallocated areas containing a few codepoints or a few dozen codepoints each, scattered unpredictably among a like number of allocated areas containing a few dozen or a few score codepoints each. That means that the code necessary to iterate through valid codepoints must occupy more than a few tens of bytes, and the data necessary to do so must occupy hundreds or thousands of bytes. The implication is that you can’t have a program that iterates through codepoints without constantly consulting a table that doesn’t fit in cache, and using code that doesn’t fit in a single cache line. That will slow down some operations by a factor of ten or more, depending on how complex they are relative to the trivial task of just getting the codepoint to start work on and how many other things there are putting pressure on cache space.
- Second, there are a lot of lookalike characters. Can you tell the difference at a glance between ‘A’ and ‘Α’? Or between ‘B’ and ‘Β’? The first pair of characters, by the way, are our familiar LATIN CAPITAL LETTER A at U+C0 and GREEK CAPITAL LETTER ALPHA at U+391, and the second pair is LATIN CAPITAL LETTER B at U+C1 and GREEK CAPITAL LETTER BETA at U+392. There are literally thousands of pairs of visually identical characters like that in Unicode. That means hackers and people of evil intent can write literally millions of different identifiers that look exactly alike. Think about what that means. The familiar identifier that’s been imported from some known and safe library is not necessarily the identifier you’re looking at when you see its image. Instead, you may be seeing an identifier you never knew about, which is the name of a routine you don’t know exists, getting called and doing something maliciously and invisibly from inside the code whose security you’re trying to audit.
- Third, there are multiple different ways to express the same character. A Unicode codepoint designating a character can be followed by any number of codepoints designating combining accents, variant selectors, etc. But many popular combinations of variants and accented characters are also available as single codepoints. Thus the same character can be represented by different sequences or different numbers of codepoints. This ambiguity, unlike that addressed in the previous point, is readily remedied without limiting the range of available ‘real’ identifiers by requiring code to be in a single normalization form. But if it’s important to you to process each character only once in iterating over the set, it means more code complexity and table dependency whenever you’re iterating through characters.
- Fourth, the character set is infinite. While there are a lot of Unicode codepoints that designate characters, as I said each can be followed by any number of combining codepoints designating variant selectors and combining accents. Each of these combinations is also a character. You can’t iterate over the complete character set in less time than it will take the sun to explode, no matter how fast a processor you have. There are several different things this can mean in a programming language. First, you can simply regard the character set as being infinite, like the integers, and just accept that. Second, you can restrict your source code to characters that can be expressed as a single codepoint, or as two or three codepoints, which allows your character set to be finite but extremely large. Third, you can even restrict yourself further without feeling much pain. Java, for example, restricts its source code to only those characters which can be expressed as a single codepoint among the first 65536 codepoints, and it’s not all that bad a restriction.
- Fifth, there are four standard normalization forms for Unicode and they represent two different character sets — NFC and NFD are two different ways of representing the entire infinite set of characters, while NFKC and NFKD are two different ways of representing a reduced, but still infinite, subset of those characters. The full character set (that represented by NFC and NFD) is worse for source code in terms of having vastly more visually ambiguous characters. However, the restricted set (that represented by NFKC and NFKD) still contains enough visually ambiguous pairs to utterly destroy our ability to reliably detect differences between identifiers. Unfortunately, it also excludes many of the mathematical characters that were part of the motivation for adopting Unicode for source code in the first place.
- Sixth, things we have traditionally considered as character operations — for example changing from lowercase to uppercase in cased alphabets — are not closed on the sets of characters that can be represented as some particular number of codepoints. Unicode provides uppercase characters whose lowercase can be constructed only by means of combining characters, and vice versa. It would have been very simple to include the opposite-case image of every cased character in the set using the same number of codepoints, especially since it was necessary anyway to define representable opposite-cases for each, but remember that Unicode was accumulated, not designed. Because the opposite-case characters weren’t single codepoints in the broken standards it accumulated, they aren’t assigned single codepoints in the final product, and people who write code must now suffer the complexity, and consumers of code the inefficiency and wasted time and power that it mandates, forever. This means, effectively, that there can be no real difference in representation between strings and individual characters. Every character operation now must keep track of the length of the character in codepoints in the same way that string operations previously tracked the length of strings in characters. Worse, some of the most common operations on strings have now become potentially length-changing operations. That means you now have a painful choice to make; you can store strings in contiguous memory for simplicity of representation and interoperability, but that will mean that case operations on individual characters in strings are no longer guaranteed to be constant-time operations. Instead, they become operations that may require reading and re-copying every character between that point and the end of the string, which makes them linear-time proportional to the length of the string. This raises the exponent in many polynomial-time algorithms on strings by one. The alternative to that pain is to accept the pain of a more complex string representation, which costs you code, cache misses, bugs until you get it worked out, and, most importantly, interoperability. Your more complex representation will almost certainly not be the representation that some other language is using, and therefore you cannot simply pass a buffer to a string routine that’s been defined in some other language. In contrast to this situation, checking the case or switching to uppercase or lowercase in ASCII, because ASCII was actually designed rather than accumulated, is a single instruction on most architectures. You don’t even need to look it up in a table. All you have to do with a value in the alphabetic range is to check, set, or clear the fifth bit.
- There are three standard encodings — UTF8, UTF16, and UTF32. Each of them comes in two byte orders: LE and BE. Here is another painful choice. If you pick UTF32, you get random-access for codepoints, but it costs you compactness of representation. If you pick UTF8, you get compactness in the usual case, but lose random-access for codepoints. If you pick UTF-16, you have neither random-access for codepoints nor the most compact representation, but you have random-access for the codepoints most commonly used and a reasonably compact representation. Losing random-access for codepoints means you don’t know where the N’th codepoint is located in the buffer without reading the buffer from the beginning. That means that in principle accessing the Nth codepoint takes time linear in N. This would raise the exponent of many polynomial algorithms, except that in practice, it can usually be made constant-time because you are usually accessing characters sequentially. Ultimately, most language implementers make their choice based on what they want to be interoperable with.
- Both Big-Endian and Little-Endian byte orders can be selected by putting a “byte order mark” – which is not a character – into the sequence of characters. In the absence of a BOM, text is seen as Big-Endian. This means that there are three different ways to read and write each of the four normalization forms. While I heartily recommend Big-Endian with no BOM for all internal representation and for default I/O, you will at minimum have to be able to read every form. In practice, this means you will have to get code libraries to do the job for you and learn a fairly substantial amount of complexity to use them. It’s less complexity/work than implementing them yourself, but it’s still onerous and unnecessary complexity.
- The libraries almost universally deal in codepoints rather than characters, which means they are lower-level than any reasonable representation of characters and strings. So you will probably still need to write substantial code if you want them to be be a good fit with the semantics you want characters in your language to have. Furthermore, because of braindead applications out there that followed the libraries’ lead and treated codepoints as characters, and also because of other developers who failed to correctly implement the vast and unnecessary complexity of Unicode, you will find in the wild many invalid sequences of codepoints — that is, sequences that do not have any clear relationship to sequences of characters. At the very least you will find strings that are not in any normalization form, but it gets worse than that. For example, you will find strings that start with combining codepoints, strings that contain noncharacters, and strings that contain control codes. You will find misplaced BOMS and switches between UTF representations and byte orders in midstring. You will find applications that allow strings to be split in mid-character, or sometimes even in mid-codepoint, recombine them in different orders, and output the result leaving your runtime to deal with the stir-fried fragments. Even the complexity of representing codepoints was too much for everyone to get right; you will find strings that contain broken sequences of bytes which do not decode according to any UTF/byte ordering and cannot even be a legitimate subsequence of anything that might. Furthermore, you cannot simply take the high road in the expectation that these problems ought to be fixed and that failing to interoperate with broken implementations is the right thing to do; a certain popular operating system reknowned for its horrific quality, which shall not be named in this document, allows such broken sequences as ostensibly Unicode filenames!! This means either you must allow explicitly wrong strings, or you can’t represent filenames as strings on that OS.
- There is no standard encoding that provides random-access on characters. While the UTF-XX encodings (for XX = 8, 16, 32) go out of their way to make it obvious where the boundaries between characters are (at least if you have a well-formed string they do, but you won’t always), it doesn’t help much because each character can be some non-uniform number of codepoints long. If you need the Nth character, you will always have to read from the beginning of the buffer to know that the character you have just found is in fact the Nth character. As with codepoints, this would in principle raise the exponent of many polynomial-time algorithms except that in practice you are often processing characters sequentially. As with codepoints, clever representations can save you but will do so at the cost of interoperability.
- The usual response of programmers has been to pretend that there is no difference between codepoints and characters, even though there is nothing like a character (as any literate non-programmer thinks of characters) that is represented by many of the assigned codepoints. This was a bad choice when the first programmers to implement Unicode made it. It was a simpler choice, quicker to complete, and a lot more like the character sets we knew, but we made that choice when we didn’t understand the Unicode standard yet, and we were wrong. Since then we’ve been either not allowed to correct it because it would break code that depends on that initial wrong choice, too lazy to correct it, or, sometimes, shamefully compelled to repeat it for the sake of being interoperable with products defined by our own earlier foolishness. The results of this bad choice are both ridiculous and illogical – If you allow splitting strings between codepoints rather than between characters, and sensibly normalize all strings, it means that when you put them back together you will often wind up with a different string than you started with.
- All of this is especially problematic for Lispy languages and other homoiconic languages. Because the language of code is the same as the language for data and can be manipulated by code in the same ways, if we include Unicode semantics in our language for data, we necessarily include it in our language for code, and that means we cannot meaningfully adopt most sensible restrictions on Unicode for the sake of making it easier to deal with in source code. For example it would be reasonable and desirable to define your own normalization form for source code in order to exclude lookalike characters. But if you are building a Lispy language, that means you have to do a horrible thing; you have to break the symmetry between code and data. Your “source code normalization,” whenever code is turned back into data, will result in potentially mangled data; capital Alpha and capital Beta in inputs have turned into capital A and capital B in outputs.
So, okay, that’s a rundown of the problems. I see that I’m at a bit over two thousand words now, so I think I’m going to stop here for today. Thanks if you’ve read this far. In later posts, I’ll talk about a few sets of choices motivated by all of these considerations, and even start posting some code. Until then, it’s a little something to think about.