In a previous post, I mentioned that I would be posting code to do with implementing Unicode in a programming language, along with some more commentary. Well, this is that post.
The first question, of course, is why an ‘implementation’ is needed at all. We don’t ‘implement’ ASCII, after all; we just type it and the computer understands it. In theory, our computers now understand Unicode in the same way, so we can just give it programs that include characters not found in ASCII and it should work.
Nice theory. There are a couple of problems with this view of the universe. One is on the desk in front of me, and probably on a different desk in front of you. It’s called a keyboard. Mine sure as heck doesn’t have a key for each Unicode character; I don’t think anyone else’s does either. We can access maybe 200 characters depending on our alt-gr configuration — but most people don’t know how to type more than about 120 of them, if that. So, our first problem is that we can’t type more than a few characters.
My approach is to implement universal escape codes. We’ve used programming languages that have ‘escape syntax’ for putting things into, eg, strings and regular expressions that otherwise would be interpreted as syntax that does undesirable things to those constructs (for example, a quote-mark has to be escaped in a string, or it’ll be read as closing the string). Those are syntax-sensitive escape codes; they exist to overcome syntax problems, and a slightly different set are required for each specialized syntax. By contrast, I’m implementing universal, or syntax-neutral, escape codes. These are universal in that they stand for the same character, regardless of whether they appear in a comment or in a string or in an identifier. So if someone uses one of these escape codes for a quotation mark, say, it will be read as the end of the string. They are simply a way to input characters, and no more than that.
So here’s some code to do that.
class escapist{
public:
escapist();
// initializer.
void enter_standard_escapes();
int unescape_single(std::wstring buffer);
std::wstring escape_single(int input);
std::wstring escape_single_absolute(int testchar);
private:
void enter_escape(std::wstring escseq, wchar_t standsfor);
// maps every escape sequence to a corresponding character
std::unordered_map< std::wstring, wchar_t > esc2char;
// maps characters to their preferred escape sequence.
std::unordered_map< wchar_t, std::wstring > char2esc;
};
// constructs an empty escapist object.
escapist::escapist(){
// std::wcout << L"allocating escapist \n";
}
void escapist::enter_escape(std::wstring escapeseq, wchar_t standsfor){
std::wstring bescapeseqb;
// every universal escape sequence starts and ends with a backslash,
// and otherwise contains only alphabetic characters. These lines
// are just adding a backslash to the alpha strings to render them
// in the right form for universal escape sequences.
bescapeseqb += L"\\";
bescapeseqb += escapeseq;
bescapeseqb += L"\\";
int checkval;
wchar_t charcheck;
// every escape sequence is unique; therefore every insertion in
// esc2char will succeed. But we're checking and asserting anyway.
checkval = esc2char.count(bescapeseqb);
assert(checkval == 0);
esc2char[bescapeseqb] = standsfor;
checkval = esc2char.count(bescapeseqb);
assert(checkval == 1);
charcheck = esc2char[bescapeseqb];
assert(charcheck == standsfor);
// reverse lookups otoh key on the character, which is not unique;
// therefore we check, and enter something only if nothing is currently
// entered. This makes the FIRST escape sequence entered be the one that
// is used for output.
if (char2esc.count(standsfor) == 0)
char2esc[standsfor] = bescapeseqb;
checkval = char2esc.count(standsfor);
assert(checkval = 1);
}
void escapist::enter_standard_escapes(){
// note: where multiple aliases exist for the same character, any of
// them can be used to input the character. If ascii-only rendering
// of code is requested, the system will use only the alais entered
// first as its output form. This motivates a few deviations from
// alphabetic ordering below, but otherwise this list is alphabetic
// (treating all capitals as preceding all lower-case letters) by
// alias.
// the values of the escape sequences are two characters longer than
// the string provided as an argument, and must be shorter than
// MAXESCAPE.
// These are thirty-two characters.
// 12345678901234567890123456789012
enter_escape( L"AElig", 198); // AE ligature
enter_escape( L"AND", 8743); // logical AND operator
enter_escape( L"Aacute", 193); // A with acute
enter_escape( L"Acirc", 194); // A with circle
enter_escape( L"Agrave", 192); // A with grave
enter_escape( L"Alpha", 913); // Greek letter (against better judgement; visually ambiguous with 'A')
enter_escape( L"Aring", 197); // A with ring above
enter_escape( L"Atilde", 195); // A with tilde
enter_escape( L"Auml", 196); // A with umlaut
enter_escape( L"Beta", 914); // Greek letter
enter_escape( L"Ccedil", 199); // C with cedilla
enter_escape( L"Chi", 935); // Greek letter
enter_escape( L"Dagger", 8225); // double dagger
enter_escape( L"Delta", 916); // Greek letter
enter_escape( L"ETH", 208); // capital Eth (nordic/icelandic)
enter_escape( L"Eacute", 201); // E with acute accent
enter_escape( L"Ecirc", 202); // E with circumflex
enter_escape( L"Egrave", 200); // E with grave accent
enter_escape( L"Epsilon", 917); // Greek letter
enter_escape( L"Eta", 919); // Greek letter
enter_escape( L"Euml", 203); // E with dieresis/umlaut
enter_escape( L"Gamma", 915); // Greek letter
enter_escape( L"Iacute", 205); // I with acute accent
enter_escape( L"Icirc", 206); // I with circumflex
enter_escape( L"Igrave", 204); // I with grave accent
enter_escape( L"Iota", 921); // Greek letter
enter_escape( L"Iuml", 207); // I with dieresis/umlaut
enter_escape( L"Kappa", 922); // Greek letter
enter_escape( L"Lambda", 923); // Greek letter
enter_escape( L"Mu", 924); // Greek letter
enter_escape( L"NAND", 8892); // logical NAND sign, AND with overbar
enter_escape( L"NOR", 8893); // logical NOR sign, OR with overbar
enter_escape( L"NOT", 172); // logical NOT operator
enter_escape( L"Ntilde", 209); // capital N with tilde
enter_escape( L"Nu", 925); // Greek letter
enter_escape( L"OElig", 338); // OE ligature
enter_escape( L"OR", 8744); // logical OR operator
enter_escape( L"Oacute", 211); // O with acute accent
enter_escape( L"Ocirc", 212); // O with circumflex
enter_escape( L"Ograve", 210); // O with grave accent
enter_escape( L"Omega", 937); // Greek letter
enter_escape( L"Omicron", 927); // Greek letter
enter_escape( L"Oslash", 216); // O with slash
enter_escape( L"Otilde", 213); // O with Tilde
enter_escape( L"Ouml", 214); // O with dieresis/umlaut
enter_escape( L"Phi", 934); // Greek letter
enter_escape( L"Pi", 928); // Greek letter
enter_escape( L"Prime", 8243); // double prime, for inches, seconds,etc. (name from XML)
enter_escape( L"Psi", 936); // Greek letter
enter_escape( L"Rho", 929); // Greek letter
enter_escape( L"Sigma", 931); // Greek letter
enter_escape( L"THORN", 222); // capital Thorn,icelandic
enter_escape( L"Tau", 932); // Greek letter
enter_escape( L"Theta", 920); // Greek letter
enter_escape( L"Uacute", 218); // I with acute accent
enter_escape( L"Ucirc", 219); // I with circumflex
enter_escape( L"Ugrave", 217); // I with grave accent
enter_escape( L"Upsilon", 933); // Greek letter
enter_escape( L"Uuml", 220); // I with dieresis/umlaut
enter_escape( L"Xi", 926); // Greek letter
enter_escape( L"XOR", 8891); // XOR sign, OR with underbar
enter_escape( L"Yacute", 221); // Y with acute accent
enter_escape( L"Yuml", 376); // Y with dieresis/umlaut
enter_escape( L"Zeta", 918); // Greek letter
enter_escape( L"bell", 7); // console bell
enter_escape( L"alarm", 7); // traditional c inline escape
enter_escape( L"a", 7); // alarm, aka bell
enter_escape( L"aacute", 225); // a with acute
enter_escape( L"acirc", 226); // a with circle
enter_escape( L"ack", 6); // acknowledge
enter_escape( L"acute", 180); // acute accent
enter_escape( L"aelig", 230); // ae ligature
enter_escape( L"agrave", 224); // a with grave
enter_escape( L"bel", 7); // bell, rendering in ascii tables
enter_escape( L"alefsym", 8501); // first transfinite cardinal
enter_escape( L"alpha", 945); // Greek letter
enter_escape( L"amp", '&'); // ampersand, unnecessary but exists in XML
enter_escape( L"angle", 8736); // angle symbol
enter_escape( L"apos", 27); // apostrophe
enter_escape( L"aring", 229); // a with ring above
enter_escape( L"approxequal", 8773); // approximately equal to
enter_escape( L"almostequal", 8776); // almost equal to
// note, XML got this wrong. They have a named entity ≈ which
// they bill as 'asymptotically equal to', but give the code for
// 'almost equal' above. Unicode has a different character which
// means asymptotic equality, given below. I'm handling it by NOT
// supporting the XML-derived name \asymp\, which may surprise users
// to some extent. The other choices were to repeat the mistake for
// compatibility, or correct the mistake and subject the user to even
// more (more unpleasant) surprise.
enter_escape( L"asympequal", 8771); // asymptotically equal to
enter_escape( L"atilde", 227); // a with tilde
enter_escape( L"auml", 228); // a with umlaut
enter_escape( L"bdquo", 8222); // double low-9 quotation mark
enter_escape( L"because", 8757); // because sign, inverted therefore sign
enter_escape( L"beta", 946); // Greek letter
enter_escape( L"brvbar", 166); // broken vertical bar
enter_escape( L"backspace", 8); // backspace
enter_escape( L"bs", 8); // backspace
enter_escape( L"b", 8); // backspace
enter_escape( L"circledequal", 8860); // bitwise equality, circled equals
enter_escape( L"circleddash", 8861); // circled dash
enter_escape( L"circleddot", 8857); // bitwise AND operator, circled dot
enter_escape( L"circledtimes", 8855); // bitwise NAND operator, circled times
enter_escape( L"circledplus", 8853); // bitwise XOR operator, circled plus
enter_escape( L"circledminus", 8854); // bitwise NOT operator, circled minus
enter_escape( L"circledslash", 8856); // bitwise OR operator, circled slash
enter_escape( L"circledring", 8858); // bitwise NOR operator, circled ring
enter_escape( L"circledstar", 8859); // bitwise XNOR operator, circled asterisk
enter_escape( L"bullet", 8226); // punctuation
enter_escape( L"can", 24); // no idea ... but shown in ASCII tables
enter_escape( L"ccedil", 231); // c with cedilla
enter_escape( L"cedil", 184); // cedilla
enter_escape( L"cent", 162); // cents
enter_escape( L"checkmark", 10003); // checkmark
enter_escape( L"crossoff", 10007); // 'x' mark for completion
enter_escape( L"chi", 967); // Greek letter
enter_escape( L"circ", 710); // modifier letter circumflex accent
enter_escape( L"clubs", 9827); // club card suit
enter_escape( L"cong", 8773); // approximately equal to
enter_escape( L"complement", 8705); // complement of
enter_escape( L"contains", 8715); // contains as member
enter_escape( L"contourintegral", 8750); // integral sign on a circle
enter_escape( L"copy", 169); // copyright sign
enter_escape( L"crarr", 8629); // down arrow with corner left, "carriage return arrow"
enter_escape( L"cuberoot", 8731); // square root symbol, radical sign
enter_escape( L"curren", 164); // currency
enter_escape( L"dArr", 8659); // downward double arrow
enter_escape( L"dagger", 8224); // dagger mark, asterisk variant
enter_escape( L"darr", 8595); // down arrow
enter_escape( L"dc1", 17); // device control 1
enter_escape( L"dc2", 18); // device control 2
enter_escape( L"dc3", 19); // device control 3
enter_escape( L"dc4", 20); // device control 4
enter_escape( L"degrees", 176); // degrees
enter_escape( L"deg", 176); // degrees
enter_escape( L"del", 127); // delete character
enter_escape( L"delta", 948); // Greek letter
enter_escape( L"diamonds", 9830); // diamond card suit
enter_escape( L"diams", 9830); // diamond card suit
enter_escape( L"divide", 247); // division sign
enter_escape( L"doesnotcontain", 8716);
enter_escape( L"doesnotdivide", 8740);
enter_escape( L"dottedequal", 8784); // approaches the limit, dotted equality
enter_escape( L"approaches", 8784); // approaches the limit, dotted equality
enter_escape( L"dottedminus", 8760); // dotted minus sign
enter_escape( L"dottedplus", 8724); // dotted plus sign
enter_escape( L"doubleasterisk", 8273); // double asterisk, vertically aligned
enter_escape( L"doubledagger", 8225); // double dagger, asterisk variant
enter_escape( L"doubleexcl", 8252); // double exclamation mark
enter_escape( L"doubleintegral", 8748); // double integral sign
enter_escape( L"doubleprime", 8243); // double prime, for inches, seconds,etc. (name from XML)
enter_escape( L"doublequestion", 8263); // double question mark
enter_escape( L"dle", 16); // data link escape
enter_escape( L"eacute", 233); // e with acute accent
enter_escape( L"ecirc", 234); // e with circumflex
enter_escape( L"egrave", 232); // e with grave accent
enter_escape( L"ellipsis", 8943); // triple-dot ellipsis
enter_escape( L"em", 25); // don't know, but it has a name in ASCII tables
enter_escape( L"empty", 8709); // empty set, null set
enter_escape( L"emsp", 8195); // em space
enter_escape( L"enq", 5); // enquire
enter_escape( L"ensp", 8194); // en space
enter_escape( L"eot", 4); // end of trasmission
enter_escape( L"epsilon", 949); // Greek letter
enter_escape( L"equivalent", 8801); // identical, triple-bar sign
enter_escape( L"equiv", 8801); // identical with
enter_escape( L"esc", 27); // escape character
enter_escape( L"eta", 951); // Greek letter
enter_escape( L"etb", 23); // end transmission block
enter_escape( L"eth", 240); // lowercase icelandic eth
enter_escape( L"etx", 3); // end of text
enter_escape( L"euml", 235); // e with dieresis/umlaut
enter_escape( L"exist", 8707); // there exists
enter_escape( L"fallingellipsis", 8945); // ellipsis from upper left to bottom right
enter_escape( L"formfeed", 12); // form feed
enter_escape( L"ff", 12); // form feed
enter_escape( L"f", 12); // form feed
enter_escape( L"forall", 8704); // forall quantification operator
enter_escape( L"fourthroot", 8732); // square root symbol, radical sign
enter_escape( L"frac12", 189); // fraction 1/2
enter_escape( L"frac14", 188); // fraction 1/4
enter_escape( L"frac34", 190); // fraction 3/4
enter_escape( L"frasl", 8260); // fraction slash
enter_escape( L"fs", 28); // file separator
enter_escape( L"gamma", 947); // Greek letter
enter_escape( L"greaterorequal", 8805); // greater or equal to
enter_escape( L"ge", 8805); // greater or equal to, stupid name from XML
enter_escape( L"gs", 29); // group separator, from ASCII tables
enter_escape( L"gt", '>'); // greater-than, probably unneeded but exists in XML
enter_escape( L"hArr", 8660); // double ended double horizontal arrow
enter_escape( L"harr", 8596); // double ended horizontal arrow
enter_escape( L"hearts", 9829); // heart card suit, valentine symbol
enter_escape( L"hellip", 8230); // horizontal ellipsis
enter_escape( L"iacute", 237); // i with acute accent
enter_escape( L"icirc", 238); // i with circumflex
enter_escape( L"iexcl", 161); // inverted exclamation mark
enter_escape( L"igrave", 236); // i with grave accent
enter_escape( L"image", 8465); // blackletter cap I/imaginary part
enter_escape( L"increment", 8710); // increment sign
enter_escape( L"infinity", 8734); // infinity symbol, lemniscate
enter_escape( L"infin", 8734); // infinity symbol, lemniscate, abbreviation from XML
enter_escape( L"integral", 8747); // integral sign
enter_escape( L"int", 8747); // integral sign, ambiguous name from XML
enter_escape( L"intersection", 8745); // Intersection operator
enter_escape( L"interrobang", 8253); // Interrobang
enter_escape( L"cap", 8745); // Intersection operator, stupid name from XML
enter_escape( L"iota", 953); // Greek letter
enter_escape( L"iquest", 191); // inverted question mark
enter_escape( L"isin", 8712); // is an element of, is a member of
enter_escape( L"elementof", 8712); // is an element of, is a member of
enter_escape( L"iuml", 239); // i with dieresis/umlaut
enter_escape( L"kappa", 954); // Greek letter
enter_escape( L"lArr", 8656); // left double arrow/implication sign
enter_escape( L"lambda", 955); // Greek letter
enter_escape( L"lang", 9001); // left pointing angle bracket
enter_escape( L"laquo", 171); // left double angle quote, left guillemet
enter_escape( L"larr", 8592); // left arrow
enter_escape( L"lceil", 8968); // left ceiling, upstile
enter_escape( L"ldquo", 8220); // left double quote
enter_escape( L"lessorequal", 8804); // less than or equal to
enter_escape( L"le", 8804); // less than or equal to, stupid name from XML
enter_escape( L"lfloor", 8970); // left floor, downstile
enter_escape( L"lowast", 8727); // low asterisk, asterisk operator
enter_escape( L"loz", 9674); // lozenge shape
enter_escape( L"lrm", 8206); // left to right mark
enter_escape( L"lt", '<'); // less-than character, 60, probably unneeded but exists in XML
enter_escape( L"lsaquo", 8249); // left-pointing single angle quote
enter_escape( L"lsquo", 8216); // left single quote
enter_escape( L"macr", 175); // macron, APL overbar
enter_escape( L"mdash", 8212); // em dash
enter_escape( L"measuredangle", 8737); // measured angle
enter_escape( L"micr", 181); // micro sign
enter_escape( L"middot", 183); // middle dot
enter_escape( L"minus", 8722); // subtaction operator, against better judgement; visually ambiguous with hyphen.
enter_escape( L"minusplus", 8723); // like plusminus, but minus symbol is on top of plus in this one
enter_escape( L"mu", 956); // Greek letter
enter_escape( L"muchlessthan", 8810); // much less than, double less than sign
enter_escape( L"muchgreaterthan", 8811); // much less than, double greater than sign
enter_escape( L"nabla", 8711); // backward difference
enter_escape( L"nak", 21); // negative acknowledge
enter_escape( L"nbsp", 160); // nonbreaking space
enter_escape( L"ndash", 8211); // en dash
enter_escape( L"newline", 10); // unix-standard newline character
enter_escape( L"n", 10); // newline character, from traditional inline escapes
enter_escape( L"ni", 8715); // contains as member, stupid name from XML
enter_escape( L"not", 172); // Logical NOT operator, not sign
enter_escape( L"notapproxequal", 8775); // not approximately equal to
enter_escape( L"notalmostequal", 8777); // not almost equal to
enter_escape( L"notasympequal", 8772); // not asymptotically equal to
enter_escape( L"notequal", 8800); // not equal to
enter_escape( L"notequivalent", 8802); // not identical to, crossed triple-bar sign
enter_escape( L"ne", 8800); // not equal to, stupid name from XML
enter_escape( L"notexist", 8708); // there does not exist
enter_escape( L"notin", 8713); // not a member of
enter_escape( L"notelement", 8713); // not an element of
enter_escape( L"notlessthan", 8714); // not less than
enter_escape( L"notgreaterthan", 8715); // not greater than
enter_escape( L"notparallel", 8742); // not parallel to, crossed double vertical bar
enter_escape( L"notsubset", 8836); // not a subset of
enter_escape( L"nsub", 8836); // not a subset of, stupid name from XML
enter_escape( L"notsuperset", 8837); // not a superset of
enter_escape( L"nsup", 8837); // not a superset of, stupid name for symmetry with nsub from XML.
enter_escape( L"ntilde", 241); // n with tilde
enter_escape( L"nu", 957); // Greek letter
enter_escape( L"null", 0); // null
enter_escape( L"nul", 0); // null
enter_escape( L"oacute", 243); // o with acute accent
enter_escape( L"ocirc", 244); // o with circumflex
enter_escape( L"oelig", 339); // oe ligature
enter_escape( L"ograve", 242); // o with grave accent
enter_escape( L"oline", 8254); // spacing overline
enter_escape( L"omega", 969); // Greek letter
enter_escape( L"omicron", 959); // Greek letter
enter_escape( L"oplus", 8853); // circled plus sign, bitwise OR, name from XML
enter_escape( L"ordf", 170); // feminine ordinal indicator
enter_escape( L"ordm", 186); // masculine ordinal indicator
enter_escape( L"oslash", 248); // o with slash
enter_escape( L"otilde", 245); // o with Tilde
enter_escape( L"otimes", 8855); // circled multiplication sign, bitwise XOR
enter_escape( L"ouml", 246); // o with dieresis/umlaut
enter_escape( L"paragraph", 182); // pilcrow / paragraph sign
enter_escape( L"para", 182); // pilcrow / paragraph sign
enter_escape( L"parallel", 8741); // parallel to, double vertical bar
enter_escape( L"part", 8706); // partial differential sign
enter_escape( L"permille", 8240); // permille sign
enter_escape( L"permil", 8240); // permille sign
enter_escape( L"perp", 8869); // up tack, perpendicular to
enter_escape( L"phi", 966); // Greek letter
enter_escape( L"pi", 960); // Greek letter
enter_escape( L"pilcrow", 182); // pilcrow / paragraph sign
enter_escape( L"piv", 982); // Greek letter
enter_escape( L"plusminus", 177); // plus or minus
enter_escape( L"plusmn", 177); // plus or minus
enter_escape( L"pound", 163); // pounds sterling
enter_escape( L"powerset", 8472); // capital script P, power set symbol
enter_escape( L"prime", 8242); // for feet, minutes, etc.
enter_escape( L"prod", 8719); // product operator, looks like Pi
enter_escape( L"prop", 8733); // proportional to
enter_escape( L"propersubset", 8842); // proper subset of, subset with not equal to
enter_escape( L"propersuperset", 8843); // proper superset of, superset with not equal to
enter_escape( L"psi", 968); // Greek letter
enter_escape( L"quot", L'"'); // quotation mark, probably unnecessary but exists in XML
enter_escape( L"return", 13); // dos/win end of line character
enter_escape( L"r", L'\r'); // return, 13
enter_escape( L"rArr", 8658); // right double arrow/implication sign
enter_escape( L"rang", 9002); // right pointing angle bracket
enter_escape( L"raquo", 187); // right double angle quote, right guillemet
enter_escape( L"rarr", 8594); // right arrow
enter_escape( L"rceil", 8969); // right ceiling
enter_escape( L"rdquo", 8221); // right double quote
enter_escape( L"real", 8476); // blackletter cap R/real part
enter_escape( L"refmark", 8251); // reference mark, asterisk variant
enter_escape( L"reg", 174); // registered sign
enter_escape( L"rfloor", 8971); // right floor
enter_escape( L"rho", 961); // Greek letter
enter_escape( L"rightangle", 8735); // angle symbol
enter_escape( L"risingellipsis", 8944); // ellipsis from bottom left to upper right
enter_escape( L"rlm", 8207); // right to left mark
enter_escape( L"rs", 30); // record separator
enter_escape( L"rsaquo", 8250); // right-pointing single angle quote
enter_escape( L"rsquo", 8217); // right single quote
enter_escape( L"sbquo", 8218); // single low-9 quotation mark
enter_escape( L"scaron", 353); // s with caron
enter_escape( L"sdot", 8901); // dot operator, symbol dot
enter_escape( L"sect", 167); // section sign
enter_escape( L"shy", 173); // soft hyphen
enter_escape( L"si", 15); // shift in
enter_escape( L"sigma", 962); // Greek letter
enter_escape( L"sigmaf", 963); // Greek letter
enter_escape( L"sim", 8764); // similar to, similar to ~
enter_escape( L"so", 14); // shift out
enter_escape( L"soh", 1); // start of header
enter_escape( L"space", L' '); // space character
enter_escape( L"spades", 9824); // spade card suit
enter_escape( L"sqareroot", 8730); // square root symbol, radical sign
enter_escape( L"sphericalangle", 8738); // spherical angle sign
enter_escape( L"sqrt", 8730); // square root symbol, radical sign
enter_escape( L"suchthat", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
enter_escape( L"divides", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
enter_escape( L"APLstile", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
enter_escape( L"radic", 8730); // square root symbol, radical sign
enter_escape( L"star", 9733); // star, asterisk variant
enter_escape( L"stx", 2); // start of text
enter_escape( L"subsetof", 8834); // subset of,
enter_escape( L"sub", 8834); // subset of, stupid name from XML
enter_escape( L"subsetorequal", 8838); // subset of or equal to
enter_escape( L"sube", 8838); // subset of or equal to, stupid name from XML
enter_escape( L"subst", 26); // substitute
enter_escape( L"sum", 8721); // sum operator, but looks like Sigma
enter_escape( L"supersetof", 8835); // superset of
enter_escape( L"sup", 8835); // superset of, stupid name from XML
enter_escape( L"sup1", 185); // superscript 1
enter_escape( L"sup2", 178); // superscript 2
enter_escape( L"sup3", 179); // superscript 3
enter_escape( L"supersetorequal", 8839); // superset of or equal to
enter_escape( L"supe", 8839); // superset of or equal to, stupid name from XML
enter_escape( L"syn", 22); // synchronization char
enter_escape( L"szlig", 223); // german sz ligature, "sharp s"
enter_escape( L"tab", 9); // tab
enter_escape( L"t", 9); // tab, from c traditional inline escapes
enter_escape( L"tau", 964); // Greek letter
enter_escape( L"therefore", 8756); // therefore, three-dot proof sign
enter_escape( L"there4", 8756); // therefore, three-dot proof sign, stupid name from XML
enter_escape( L"theta", 952); // Greek letter
enter_escape( L"thinsp", 8201); // thin space
enter_escape( L"thorn", 254); // thorn, icelandic
enter_escape( L"tilde", 732); // modifier letter small tilde
enter_escape( L"times", 215); // multiplication sign
enter_escape( L"trade", 8482); // trademark sign
enter_escape( L"tripleasterisk", 8258); // triple-asterisk mark
enter_escape( L"asterism", 8258); // triple-asterisk mark
enter_escape( L"tripleintegral", 8749); // triple integral sign
enter_escape( L"tripleprime", 8244); // double prime, for inches, seconds,etc. (name from XML)
enter_escape( L"uArr", 8657); // upward double arrow
enter_escape( L"uacute", 250); // u with acute accent
enter_escape( L"uarr", 8593); // up arrow
enter_escape( L"ucirc", 251); // u with circumflex
enter_escape( L"ugrave", 249); // u with grave accent
enter_escape( L"uml", 168); // umlaut/dieresis
enter_escape( L"union", 8746); // Union operator
enter_escape( L"cup", 8746); // Union operator, stupid name from XML
enter_escape( L"upsih", 978); // Greek letter
enter_escape( L"upsilon", 965); // Greek letter
enter_escape( L"us", 31); // unit separator
enter_escape( L"uuml", 252); // u with dieresis/umlaut
enter_escape( L"verticalellipsis",8942); // vertical ellipsis
enter_escape( L"vtab", 11); // vertical tab
enter_escape( L"v", 11); // vertical tab
enter_escape( L"weierp", 8472); // capital script P, power set symbol
enter_escape( L"xi", 958); // Greek letter
enter_escape( L"yacute", 253); // y with acute accent
enter_escape( L"yen", 165); // Yen sign
enter_escape( L"yuml", 255); // y with dieresis/umlaut
enter_escape( L"zeta", 950); // Greek letter
enter_escape( L"zwj", 8205); // zero width joiner
enter_escape( L"zwnj", 8204); // zero width nonjoiner
}
int escapist::unescape_single(std::wstring buffer){
// We have a possible UEC in the buffer. We check it against the
// escape table. If it is a UEC return the character. Else return
// WEOF.
int count = esc2char.count(buffer);
if (count > 0)
return(esc2char[buffer]);
else{
size_t count;
int codepoint;
wchar_t next;
bool valid = true;
std::wistringstream charsin(buffer);
// This is a stupid way to read/parse a hex unicode escape, but it
// won't throw an exception, it won't return a wrong answer if the
// input isn't what we expect, it's very clear to trace and
// understand, and it'll do what I need done 'till I figure out
// the right way to do it.
for (count = 0; count < buffer.size();count++){
charsin >> next;
if (count == 0 && next != L'\\') valid = false;
if (count +1 == buffer.size() && (next != L'\\')) valid = false;
if (count == 1 && (next != L'u' && next != L'U')) valid = false;
if (count == 2 && (next != L'+')) valid = false;
if (count == 3) {
if (next >= L'0' && next <= L'9') codepoint = next - L'0';
else if (next >= L'A' && next <= L'F') codepoint = 10 + next - L'A';
else if (next >= L'a' && next <= L'f') codepoint = 10 + next - L'a';
else valid = false;
}
if (count > 3){
codepoint *= 16;
if (next >= L'0' && next <= L'9') codepoint += (next - L'0');
else if (next >= L'A' && next <= L'F') codepoint += (10 + next - L'A');
else if (next >= L'a' && next <= L'f') codepoint += (10 + next - L'a');
else valid = false;
}
}
if (valid) return (codepoint);
}
return(WEOF);
}
// returns a string giving a UEC for a character if one has been
// registered. Otherwise returns an empty string.
std::wstring escapist::escape_single(int testchar){
std::wstring retval;
if (char2esc.count(testchar) > 0)
retval += char2esc[testchar];
return(retval);
}
// returns a string giving a UEC for a character, regardless of
// whether one has been registered.
std::wstring escapist::escape_single_absolute(int testchar){
std::wostringstream escape;
std::wstring reg = escape_single(testchar);
if (reg.length() == 0)
escape << L"\\U+" << std::hex << testchar << '\\';
else
escape << reg;
return(escape.str());
}
Here is my escapist class, some instance of which is a member of any lexer object. Simply explained, it’s two hash tables; one mapping escape codes to wide characters, and one mapping wide characters to escape codes. The first is keyed by escape code, so you can alias all the escape codes you want and all of them will work. The second is keyed by character, so you can have only one escape code per character. This is handled simply, by having the first escape code entered be the one that is returned when the character gets unescaped.
escapist::enter_standard_escapes just establishes a whole bunch of convenience names for characters that people will probably use a lot. There are also escape codes consisting solely of ‘U+’ and a unicode hex codepoint set between two backslashes for any other unicode character that someone wants to stick into their source code.
These escape codes simply don’t exist after they’re read; the source expressed with escape codes is exactly the same code as the source code expressed with wide characters. Once the system has read it, it doesn’t care what form it read it in. Thus, the escape codes are not part of the source; they’re simply a means of entering the source. That means, when you type the closing backslash on a form like \lambda\, the form disappears and the lowercase greek letter lambda itself goes into your source code.
The rest of the magic is in lexer::getchar(), which calls the “unescape_single” routine and manages the lexer’s queue of characters read but not yet parsed. Usually, it just checks its buffer to see if it has a character to return, and returns the first such character if so, removing it from the queue. If the queue is empty, then it reads a character, and if it’s not a backslash, returns it. Finding a backslash requires some reading ahead, either to a closing backslash, an I/O error, or the maximum length an escape sequence can be. So until one of those things happens, characters are read into the queue. If there’s a closing backslash, we call unescape_single to try to match it with an escape code. Otherwise, or if there’s no match, lexer::getchar just deletes the initial backslash from the queue and return that.
So, anyway; the escapist is a very simple object. It didn’t even need a custom constructor.