Daily Archives: 10 August, 2014

Ncursesw and Unicode

The version of the popular ncurses library that handles wide characters, or Unicode, is surprisingly hard to get working correctly with C programs. This article is intended to be a checklist for developers so that they can effectively use the library. This is material I learned in programming a roguelike game, but it’s useful to everybody who wants to use ncursesw with a full Unicode repertoire.

As with most development articles, this will be a bit too specific in terms of platform. This article was written with respect to a Linux development platform running Debian Linux. To the extent that your platform is different, there are likely to be important things I don’t know about getting development on your platform working with this library.

First, you have to be using a UTF-8 locale (Mine is en_US.UTF-8; I imagine others will have different choices). Type ‘locale’ at a shell prompt to be sure.

Second, you have to have a term program that can display non-ASCII characters. Most of them can handle that these days, but there are still a few holdouts. rxvt-unicode and konsole, popular term programs on Linux, are both good.

Third, you have to use a console font which contains glyphs for the non-ASCII characters that you use. Again, most default console fonts can handle that these days, but it’s still another gotcha, and if you routinely pick some random blambot font to use on the console you’re likely to miss out.

Try typing a non-ASCII character at the console prompt just to make sure you see it. If you don’t know how to type non-ASCII characters from the keyboard, that’s beyond the scope of what’s covered here and you’ll need to go and read some documentation and possibly set some keyboard preferences. Anyway, if you see it, then you’ve got the first, second, and third things covered.

Fourth, you have to have ncurses configured to deal with wide characters. For most linux distributions, that means: Your ncurses distribution is based on version 5.4 or later (mine is 5.9) but NOT on version 11. I have no idea where version 11 came from, but it’s definitely a fork based on a pre-5.4 ncurses version, and hasn’t got the Unicode extensions. Also, you must have the ‘ncursesw’ versions, which are configured and compiled for wide characters.

How this works depends on your distribution, but for Debian, you have to get both the ‘ncursesw‘ package to run ncurses programs that use wide characters and the 'ncursesw-dev‘ package to compile them. The current versions are ncursesw5 and ncursesw5-dev.

But there’s an apparent packaging mistake where the wide-character dev package, ncursesw-dev, does not contain any documentation for the wide-character functions. If you want the man pages for the wide-character curses functions, you must also install ncurses-dev, which comes with a “wrong” version of ncurses that doesn’t have the wide-character functions. Don’t think too much about why anyone would do this; you’ll only break your head. The short version of the story is that you pretty much have to install ncurses, ncurses-dev, ncursesw, and ncursesw-dev, all at the same time, and then just be very very careful about not ever using the library versions that don’t actually have the wide character functions in them.

Fifth, your program has to call “setlocale” immediately after it starts up, before it starts curses or does any I/O. If it doesn’t call setlocale, your program will remain in the ‘C’ locale, which assumes that the terminal cannot display any characters outside the ASCII set. If you do any input or output, or start curses before calling setlocale, you will force your runtime to commit to some settings before it knows the locale, and then setlocale when you do call it won’t have all of the desired effects. Your program is likely to print ASCII transliterations for characters outside the ASCII range if this happens.

Sixth, you have to #define _XOPEN_SOURCE_EXTENDED in your source before any library #include statements. The wide character curses functions are part of a standard called the XOPEN standard, and preprocessing conditionals check this symbol to see whether your program expects to use that standard. If this symbol is found, and you’ve included the right headers (see item Seven) then macroexpansion will configure the headers you include to actually contain definitions for the documented wide-character functions. But it’s not just the ‘curses’ headers that depend on it; you will get bugs and linking problems with other libraries if you have this symbol defined for some includes but not others, so put it before all include statements.

Unfortunately, the XOPEN_SOURCE_EXTENDED macro is not mentioned in the man pages of many of the functions that won’t link if you don’t do it. You’d have to hunt through a bunch of not-very-obviously related ‘see also’ pages before you find one that mentions it, and then it might not be clear that it relates to the function you were interested in. Trust me, it does. Without this macro, you can use the right headers and still find that there are no wide-curses definitions in them to link to.

Seventh, you have to include the right header file rather than the one the documentation tells you to include. This isn’t a joke. The man page tells you that you have to include “curses.h” to get any of the wide-character functions working, but the header that actually contains the wide-character function definitions is “ncursesw/curses.h“. I hope this gets fixed soon but it’s been this way for several years so some idiot may think this isn’t a bug.

Eighth, you have to use the -lncursesw compiler option (as opposed to the -lncurses option) when you’re linking your executable. Earlier versions of gcc had a bug and could not use the compiler options -Werror or -Wall at the same time as -lncursesw; the symptom was that the prototypes for wide-character functions would not be found to link. This appears to be fixed in the current version of gcc.

Ninth, use the wide-character versions of everything, not just a few things. This is harder than it ought to be, because the library doesn’t issue link warnings to warn you about mixing functionality, things work about halfway okay if you mix them, and the documentation doesn’t specifically say which of the things it recommends won’t work correctly with wide characters. That means cchar_t rather than chtype, wide video attributes rather than standard video attributes, and setcchar rather than OR to combine attributes with character information.

Use cchar_t rather than chtype, and all the other choices that implies, meaning add_wch and friends rather than addch etc. Just look at the man pages for each function, and if it requires a chtype argument or returns a chtype result, be assured that it’s the wrong thing to use. Cchar_t is a record type that contains colorpair information, video attributes, and a short Unicode string. The only thing about the Unicode string that affects your display is the first spacing character, which must also be the first character. So the rest of the string is pretty useless until someone implements a term program that handles Unicode combining characters, but you still have to build a null-terminated Unicode string to make a cchar_t.

Use the new WA_* video attributes rather than the older A_* video attributes. That is, WA_STANDOUT rather than A_STANDOUT, WA_UNDERLINE rather than A_UNDERLINE, and so on. The WA_* attributes are of the newly defined attr_t type and have their bits aligned correctly for using in cchar_t rather than chtype. On my platform, attr_t is an unsigned long int. If you have code that casts video attributes to or from int or short int, it will fail with wide video attributes. This implies using wattrset rather than attrset, etc.

Use get_wch rather than getch to get input from the keyboard. If the keyboard driver delivers Unicode characters, you want the whole character rather than just the last 8 bits of it, right? More importantly, if you use getch, it will be hard to distinguish the values of Unicode characters from the keycode values that can be returned when someone presses, eg, the delete key or the end key, etc.

Use setcchar to combine character, wide video attributes and colorpair number together into a cchar_t. Your existing curses code probably uses logical OR. The documentation says you can use OR, but the documentation is talking about single ASCII characters, chtype and narrow attributes rather than Unicode strings, cchar_t and wide attributes, and it will definitely do the wrong thing if you try to use it here. You can still use logical OR to combine wide video attributes, but don’t attempt to combine them with the character values or with narrow attributes. Note that the color pair number can be converted into a video attribute using the COLOR_PAIR(n) macro provided by ncurses, and can then be correctly combined with wide or narrow video attributes.

Now, if you jumped through all the hoops, you can compile and use an ncursesw application with support for Unicode characters.