In a previous post, I mentioned that I would be posting code to do with implementing Unicode in a programming language, along with some more commentary. Well, this is that post.
The first question, of course, is why an ‘implementation’ is needed at all. We don’t ‘implement’ ASCII, after all; we just type it and the computer understands it. In theory, our computers now understand Unicode in the same way, so we can just give it programs that include characters not found in ASCII and it should work.
Nice theory. There are a couple of problems with this view of the universe. One is on the desk in front of me, and probably on a different desk in front of you. It’s called a keyboard. Mine sure as heck doesn’t have a key for each Unicode character; I don’t think anyone else’s does either. We can access maybe 200 characters depending on our alt-gr configuration — but most people don’t know how to type more than about 120 of them, if that. So, our first problem is that we can’t type more than a few characters.
My approach is to implement universal escape codes. We’ve used programming languages that have ‘escape syntax’ for putting things into, eg, strings and regular expressions that otherwise would be interpreted as syntax that does undesirable things to those constructs (for example, a quote-mark has to be escaped in a string, or it’ll be read as closing the string). Those are syntax-sensitive escape codes; they exist to overcome syntax problems, and a slightly different set are required for each specialized syntax. By contrast, I’m implementing universal, or syntax-neutral, escape codes. These are universal in that they stand for the same character, regardless of whether they appear in a comment or in a string or in an identifier. So if someone uses one of these escape codes for a quotation mark, say, it will be read as the end of the string. They are simply a way to input characters, and no more than that.
So here’s some code to do that.
class escapist{ public: escapist(); // initializer. void enter_standard_escapes(); int unescape_single(std::wstring buffer); std::wstring escape_single(int input); std::wstring escape_single_absolute(int testchar); private: void enter_escape(std::wstring escseq, wchar_t standsfor); // maps every escape sequence to a corresponding character std::unordered_map< std::wstring, wchar_t > esc2char; // maps characters to their preferred escape sequence. std::unordered_map< wchar_t, std::wstring > char2esc; }; // constructs an empty escapist object. escapist::escapist(){ // std::wcout << L"allocating escapist \n"; } void escapist::enter_escape(std::wstring escapeseq, wchar_t standsfor){ std::wstring bescapeseqb; // every universal escape sequence starts and ends with a backslash, // and otherwise contains only alphabetic characters. These lines // are just adding a backslash to the alpha strings to render them // in the right form for universal escape sequences. bescapeseqb += L"\\"; bescapeseqb += escapeseq; bescapeseqb += L"\\"; int checkval; wchar_t charcheck; // every escape sequence is unique; therefore every insertion in // esc2char will succeed. But we're checking and asserting anyway. checkval = esc2char.count(bescapeseqb); assert(checkval == 0); esc2char[bescapeseqb] = standsfor; checkval = esc2char.count(bescapeseqb); assert(checkval == 1); charcheck = esc2char[bescapeseqb]; assert(charcheck == standsfor); // reverse lookups otoh key on the character, which is not unique; // therefore we check, and enter something only if nothing is currently // entered. This makes the FIRST escape sequence entered be the one that // is used for output. if (char2esc.count(standsfor) == 0) char2esc[standsfor] = bescapeseqb; checkval = char2esc.count(standsfor); assert(checkval = 1); } void escapist::enter_standard_escapes(){ // note: where multiple aliases exist for the same character, any of // them can be used to input the character. If ascii-only rendering // of code is requested, the system will use only the alais entered // first as its output form. This motivates a few deviations from // alphabetic ordering below, but otherwise this list is alphabetic // (treating all capitals as preceding all lower-case letters) by // alias. // the values of the escape sequences are two characters longer than // the string provided as an argument, and must be shorter than // MAXESCAPE. // These are thirty-two characters. // 12345678901234567890123456789012 enter_escape( L"AElig", 198); // AE ligature enter_escape( L"AND", 8743); // logical AND operator enter_escape( L"Aacute", 193); // A with acute enter_escape( L"Acirc", 194); // A with circle enter_escape( L"Agrave", 192); // A with grave enter_escape( L"Alpha", 913); // Greek letter (against better judgement; visually ambiguous with 'A') enter_escape( L"Aring", 197); // A with ring above enter_escape( L"Atilde", 195); // A with tilde enter_escape( L"Auml", 196); // A with umlaut enter_escape( L"Beta", 914); // Greek letter enter_escape( L"Ccedil", 199); // C with cedilla enter_escape( L"Chi", 935); // Greek letter enter_escape( L"Dagger", 8225); // double dagger enter_escape( L"Delta", 916); // Greek letter enter_escape( L"ETH", 208); // capital Eth (nordic/icelandic) enter_escape( L"Eacute", 201); // E with acute accent enter_escape( L"Ecirc", 202); // E with circumflex enter_escape( L"Egrave", 200); // E with grave accent enter_escape( L"Epsilon", 917); // Greek letter enter_escape( L"Eta", 919); // Greek letter enter_escape( L"Euml", 203); // E with dieresis/umlaut enter_escape( L"Gamma", 915); // Greek letter enter_escape( L"Iacute", 205); // I with acute accent enter_escape( L"Icirc", 206); // I with circumflex enter_escape( L"Igrave", 204); // I with grave accent enter_escape( L"Iota", 921); // Greek letter enter_escape( L"Iuml", 207); // I with dieresis/umlaut enter_escape( L"Kappa", 922); // Greek letter enter_escape( L"Lambda", 923); // Greek letter enter_escape( L"Mu", 924); // Greek letter enter_escape( L"NAND", 8892); // logical NAND sign, AND with overbar enter_escape( L"NOR", 8893); // logical NOR sign, OR with overbar enter_escape( L"NOT", 172); // logical NOT operator enter_escape( L"Ntilde", 209); // capital N with tilde enter_escape( L"Nu", 925); // Greek letter enter_escape( L"OElig", 338); // OE ligature enter_escape( L"OR", 8744); // logical OR operator enter_escape( L"Oacute", 211); // O with acute accent enter_escape( L"Ocirc", 212); // O with circumflex enter_escape( L"Ograve", 210); // O with grave accent enter_escape( L"Omega", 937); // Greek letter enter_escape( L"Omicron", 927); // Greek letter enter_escape( L"Oslash", 216); // O with slash enter_escape( L"Otilde", 213); // O with Tilde enter_escape( L"Ouml", 214); // O with dieresis/umlaut enter_escape( L"Phi", 934); // Greek letter enter_escape( L"Pi", 928); // Greek letter enter_escape( L"Prime", 8243); // double prime, for inches, seconds,etc. (name from XML) enter_escape( L"Psi", 936); // Greek letter enter_escape( L"Rho", 929); // Greek letter enter_escape( L"Sigma", 931); // Greek letter enter_escape( L"THORN", 222); // capital Thorn,icelandic enter_escape( L"Tau", 932); // Greek letter enter_escape( L"Theta", 920); // Greek letter enter_escape( L"Uacute", 218); // I with acute accent enter_escape( L"Ucirc", 219); // I with circumflex enter_escape( L"Ugrave", 217); // I with grave accent enter_escape( L"Upsilon", 933); // Greek letter enter_escape( L"Uuml", 220); // I with dieresis/umlaut enter_escape( L"Xi", 926); // Greek letter enter_escape( L"XOR", 8891); // XOR sign, OR with underbar enter_escape( L"Yacute", 221); // Y with acute accent enter_escape( L"Yuml", 376); // Y with dieresis/umlaut enter_escape( L"Zeta", 918); // Greek letter enter_escape( L"bell", 7); // console bell enter_escape( L"alarm", 7); // traditional c inline escape enter_escape( L"a", 7); // alarm, aka bell enter_escape( L"aacute", 225); // a with acute enter_escape( L"acirc", 226); // a with circle enter_escape( L"ack", 6); // acknowledge enter_escape( L"acute", 180); // acute accent enter_escape( L"aelig", 230); // ae ligature enter_escape( L"agrave", 224); // a with grave enter_escape( L"bel", 7); // bell, rendering in ascii tables enter_escape( L"alefsym", 8501); // first transfinite cardinal enter_escape( L"alpha", 945); // Greek letter enter_escape( L"amp", '&'); // ampersand, unnecessary but exists in XML enter_escape( L"angle", 8736); // angle symbol enter_escape( L"apos", 27); // apostrophe enter_escape( L"aring", 229); // a with ring above enter_escape( L"approxequal", 8773); // approximately equal to enter_escape( L"almostequal", 8776); // almost equal to // note, XML got this wrong. They have a named entity ≈ which // they bill as 'asymptotically equal to', but give the code for // 'almost equal' above. Unicode has a different character which // means asymptotic equality, given below. I'm handling it by NOT // supporting the XML-derived name \asymp\, which may surprise users // to some extent. The other choices were to repeat the mistake for // compatibility, or correct the mistake and subject the user to even // more (more unpleasant) surprise. enter_escape( L"asympequal", 8771); // asymptotically equal to enter_escape( L"atilde", 227); // a with tilde enter_escape( L"auml", 228); // a with umlaut enter_escape( L"bdquo", 8222); // double low-9 quotation mark enter_escape( L"because", 8757); // because sign, inverted therefore sign enter_escape( L"beta", 946); // Greek letter enter_escape( L"brvbar", 166); // broken vertical bar enter_escape( L"backspace", 8); // backspace enter_escape( L"bs", 8); // backspace enter_escape( L"b", 8); // backspace enter_escape( L"circledequal", 8860); // bitwise equality, circled equals enter_escape( L"circleddash", 8861); // circled dash enter_escape( L"circleddot", 8857); // bitwise AND operator, circled dot enter_escape( L"circledtimes", 8855); // bitwise NAND operator, circled times enter_escape( L"circledplus", 8853); // bitwise XOR operator, circled plus enter_escape( L"circledminus", 8854); // bitwise NOT operator, circled minus enter_escape( L"circledslash", 8856); // bitwise OR operator, circled slash enter_escape( L"circledring", 8858); // bitwise NOR operator, circled ring enter_escape( L"circledstar", 8859); // bitwise XNOR operator, circled asterisk enter_escape( L"bullet", 8226); // punctuation enter_escape( L"can", 24); // no idea ... but shown in ASCII tables enter_escape( L"ccedil", 231); // c with cedilla enter_escape( L"cedil", 184); // cedilla enter_escape( L"cent", 162); // cents enter_escape( L"checkmark", 10003); // checkmark enter_escape( L"crossoff", 10007); // 'x' mark for completion enter_escape( L"chi", 967); // Greek letter enter_escape( L"circ", 710); // modifier letter circumflex accent enter_escape( L"clubs", 9827); // club card suit enter_escape( L"cong", 8773); // approximately equal to enter_escape( L"complement", 8705); // complement of enter_escape( L"contains", 8715); // contains as member enter_escape( L"contourintegral", 8750); // integral sign on a circle enter_escape( L"copy", 169); // copyright sign enter_escape( L"crarr", 8629); // down arrow with corner left, "carriage return arrow" enter_escape( L"cuberoot", 8731); // square root symbol, radical sign enter_escape( L"curren", 164); // currency enter_escape( L"dArr", 8659); // downward double arrow enter_escape( L"dagger", 8224); // dagger mark, asterisk variant enter_escape( L"darr", 8595); // down arrow enter_escape( L"dc1", 17); // device control 1 enter_escape( L"dc2", 18); // device control 2 enter_escape( L"dc3", 19); // device control 3 enter_escape( L"dc4", 20); // device control 4 enter_escape( L"degrees", 176); // degrees enter_escape( L"deg", 176); // degrees enter_escape( L"del", 127); // delete character enter_escape( L"delta", 948); // Greek letter enter_escape( L"diamonds", 9830); // diamond card suit enter_escape( L"diams", 9830); // diamond card suit enter_escape( L"divide", 247); // division sign enter_escape( L"doesnotcontain", 8716); enter_escape( L"doesnotdivide", 8740); enter_escape( L"dottedequal", 8784); // approaches the limit, dotted equality enter_escape( L"approaches", 8784); // approaches the limit, dotted equality enter_escape( L"dottedminus", 8760); // dotted minus sign enter_escape( L"dottedplus", 8724); // dotted plus sign enter_escape( L"doubleasterisk", 8273); // double asterisk, vertically aligned enter_escape( L"doubledagger", 8225); // double dagger, asterisk variant enter_escape( L"doubleexcl", 8252); // double exclamation mark enter_escape( L"doubleintegral", 8748); // double integral sign enter_escape( L"doubleprime", 8243); // double prime, for inches, seconds,etc. (name from XML) enter_escape( L"doublequestion", 8263); // double question mark enter_escape( L"dle", 16); // data link escape enter_escape( L"eacute", 233); // e with acute accent enter_escape( L"ecirc", 234); // e with circumflex enter_escape( L"egrave", 232); // e with grave accent enter_escape( L"ellipsis", 8943); // triple-dot ellipsis enter_escape( L"em", 25); // don't know, but it has a name in ASCII tables enter_escape( L"empty", 8709); // empty set, null set enter_escape( L"emsp", 8195); // em space enter_escape( L"enq", 5); // enquire enter_escape( L"ensp", 8194); // en space enter_escape( L"eot", 4); // end of trasmission enter_escape( L"epsilon", 949); // Greek letter enter_escape( L"equivalent", 8801); // identical, triple-bar sign enter_escape( L"equiv", 8801); // identical with enter_escape( L"esc", 27); // escape character enter_escape( L"eta", 951); // Greek letter enter_escape( L"etb", 23); // end transmission block enter_escape( L"eth", 240); // lowercase icelandic eth enter_escape( L"etx", 3); // end of text enter_escape( L"euml", 235); // e with dieresis/umlaut enter_escape( L"exist", 8707); // there exists enter_escape( L"fallingellipsis", 8945); // ellipsis from upper left to bottom right enter_escape( L"formfeed", 12); // form feed enter_escape( L"ff", 12); // form feed enter_escape( L"f", 12); // form feed enter_escape( L"forall", 8704); // forall quantification operator enter_escape( L"fourthroot", 8732); // square root symbol, radical sign enter_escape( L"frac12", 189); // fraction 1/2 enter_escape( L"frac14", 188); // fraction 1/4 enter_escape( L"frac34", 190); // fraction 3/4 enter_escape( L"frasl", 8260); // fraction slash enter_escape( L"fs", 28); // file separator enter_escape( L"gamma", 947); // Greek letter enter_escape( L"greaterorequal", 8805); // greater or equal to enter_escape( L"ge", 8805); // greater or equal to, stupid name from XML enter_escape( L"gs", 29); // group separator, from ASCII tables enter_escape( L"gt", '>'); // greater-than, probably unneeded but exists in XML enter_escape( L"hArr", 8660); // double ended double horizontal arrow enter_escape( L"harr", 8596); // double ended horizontal arrow enter_escape( L"hearts", 9829); // heart card suit, valentine symbol enter_escape( L"hellip", 8230); // horizontal ellipsis enter_escape( L"iacute", 237); // i with acute accent enter_escape( L"icirc", 238); // i with circumflex enter_escape( L"iexcl", 161); // inverted exclamation mark enter_escape( L"igrave", 236); // i with grave accent enter_escape( L"image", 8465); // blackletter cap I/imaginary part enter_escape( L"increment", 8710); // increment sign enter_escape( L"infinity", 8734); // infinity symbol, lemniscate enter_escape( L"infin", 8734); // infinity symbol, lemniscate, abbreviation from XML enter_escape( L"integral", 8747); // integral sign enter_escape( L"int", 8747); // integral sign, ambiguous name from XML enter_escape( L"intersection", 8745); // Intersection operator enter_escape( L"interrobang", 8253); // Interrobang enter_escape( L"cap", 8745); // Intersection operator, stupid name from XML enter_escape( L"iota", 953); // Greek letter enter_escape( L"iquest", 191); // inverted question mark enter_escape( L"isin", 8712); // is an element of, is a member of enter_escape( L"elementof", 8712); // is an element of, is a member of enter_escape( L"iuml", 239); // i with dieresis/umlaut enter_escape( L"kappa", 954); // Greek letter enter_escape( L"lArr", 8656); // left double arrow/implication sign enter_escape( L"lambda", 955); // Greek letter enter_escape( L"lang", 9001); // left pointing angle bracket enter_escape( L"laquo", 171); // left double angle quote, left guillemet enter_escape( L"larr", 8592); // left arrow enter_escape( L"lceil", 8968); // left ceiling, upstile enter_escape( L"ldquo", 8220); // left double quote enter_escape( L"lessorequal", 8804); // less than or equal to enter_escape( L"le", 8804); // less than or equal to, stupid name from XML enter_escape( L"lfloor", 8970); // left floor, downstile enter_escape( L"lowast", 8727); // low asterisk, asterisk operator enter_escape( L"loz", 9674); // lozenge shape enter_escape( L"lrm", 8206); // left to right mark enter_escape( L"lt", '<'); // less-than character, 60, probably unneeded but exists in XML enter_escape( L"lsaquo", 8249); // left-pointing single angle quote enter_escape( L"lsquo", 8216); // left single quote enter_escape( L"macr", 175); // macron, APL overbar enter_escape( L"mdash", 8212); // em dash enter_escape( L"measuredangle", 8737); // measured angle enter_escape( L"micr", 181); // micro sign enter_escape( L"middot", 183); // middle dot enter_escape( L"minus", 8722); // subtaction operator, against better judgement; visually ambiguous with hyphen. enter_escape( L"minusplus", 8723); // like plusminus, but minus symbol is on top of plus in this one enter_escape( L"mu", 956); // Greek letter enter_escape( L"muchlessthan", 8810); // much less than, double less than sign enter_escape( L"muchgreaterthan", 8811); // much less than, double greater than sign enter_escape( L"nabla", 8711); // backward difference enter_escape( L"nak", 21); // negative acknowledge enter_escape( L"nbsp", 160); // nonbreaking space enter_escape( L"ndash", 8211); // en dash enter_escape( L"newline", 10); // unix-standard newline character enter_escape( L"n", 10); // newline character, from traditional inline escapes enter_escape( L"ni", 8715); // contains as member, stupid name from XML enter_escape( L"not", 172); // Logical NOT operator, not sign enter_escape( L"notapproxequal", 8775); // not approximately equal to enter_escape( L"notalmostequal", 8777); // not almost equal to enter_escape( L"notasympequal", 8772); // not asymptotically equal to enter_escape( L"notequal", 8800); // not equal to enter_escape( L"notequivalent", 8802); // not identical to, crossed triple-bar sign enter_escape( L"ne", 8800); // not equal to, stupid name from XML enter_escape( L"notexist", 8708); // there does not exist enter_escape( L"notin", 8713); // not a member of enter_escape( L"notelement", 8713); // not an element of enter_escape( L"notlessthan", 8714); // not less than enter_escape( L"notgreaterthan", 8715); // not greater than enter_escape( L"notparallel", 8742); // not parallel to, crossed double vertical bar enter_escape( L"notsubset", 8836); // not a subset of enter_escape( L"nsub", 8836); // not a subset of, stupid name from XML enter_escape( L"notsuperset", 8837); // not a superset of enter_escape( L"nsup", 8837); // not a superset of, stupid name for symmetry with nsub from XML. enter_escape( L"ntilde", 241); // n with tilde enter_escape( L"nu", 957); // Greek letter enter_escape( L"null", 0); // null enter_escape( L"nul", 0); // null enter_escape( L"oacute", 243); // o with acute accent enter_escape( L"ocirc", 244); // o with circumflex enter_escape( L"oelig", 339); // oe ligature enter_escape( L"ograve", 242); // o with grave accent enter_escape( L"oline", 8254); // spacing overline enter_escape( L"omega", 969); // Greek letter enter_escape( L"omicron", 959); // Greek letter enter_escape( L"oplus", 8853); // circled plus sign, bitwise OR, name from XML enter_escape( L"ordf", 170); // feminine ordinal indicator enter_escape( L"ordm", 186); // masculine ordinal indicator enter_escape( L"oslash", 248); // o with slash enter_escape( L"otilde", 245); // o with Tilde enter_escape( L"otimes", 8855); // circled multiplication sign, bitwise XOR enter_escape( L"ouml", 246); // o with dieresis/umlaut enter_escape( L"paragraph", 182); // pilcrow / paragraph sign enter_escape( L"para", 182); // pilcrow / paragraph sign enter_escape( L"parallel", 8741); // parallel to, double vertical bar enter_escape( L"part", 8706); // partial differential sign enter_escape( L"permille", 8240); // permille sign enter_escape( L"permil", 8240); // permille sign enter_escape( L"perp", 8869); // up tack, perpendicular to enter_escape( L"phi", 966); // Greek letter enter_escape( L"pi", 960); // Greek letter enter_escape( L"pilcrow", 182); // pilcrow / paragraph sign enter_escape( L"piv", 982); // Greek letter enter_escape( L"plusminus", 177); // plus or minus enter_escape( L"plusmn", 177); // plus or minus enter_escape( L"pound", 163); // pounds sterling enter_escape( L"powerset", 8472); // capital script P, power set symbol enter_escape( L"prime", 8242); // for feet, minutes, etc. enter_escape( L"prod", 8719); // product operator, looks like Pi enter_escape( L"prop", 8733); // proportional to enter_escape( L"propersubset", 8842); // proper subset of, subset with not equal to enter_escape( L"propersuperset", 8843); // proper superset of, superset with not equal to enter_escape( L"psi", 968); // Greek letter enter_escape( L"quot", L'"'); // quotation mark, probably unnecessary but exists in XML enter_escape( L"return", 13); // dos/win end of line character enter_escape( L"r", L'\r'); // return, 13 enter_escape( L"rArr", 8658); // right double arrow/implication sign enter_escape( L"rang", 9002); // right pointing angle bracket enter_escape( L"raquo", 187); // right double angle quote, right guillemet enter_escape( L"rarr", 8594); // right arrow enter_escape( L"rceil", 8969); // right ceiling enter_escape( L"rdquo", 8221); // right double quote enter_escape( L"real", 8476); // blackletter cap R/real part enter_escape( L"refmark", 8251); // reference mark, asterisk variant enter_escape( L"reg", 174); // registered sign enter_escape( L"rfloor", 8971); // right floor enter_escape( L"rho", 961); // Greek letter enter_escape( L"rightangle", 8735); // angle symbol enter_escape( L"risingellipsis", 8944); // ellipsis from bottom left to upper right enter_escape( L"rlm", 8207); // right to left mark enter_escape( L"rs", 30); // record separator enter_escape( L"rsaquo", 8250); // right-pointing single angle quote enter_escape( L"rsquo", 8217); // right single quote enter_escape( L"sbquo", 8218); // single low-9 quotation mark enter_escape( L"scaron", 353); // s with caron enter_escape( L"sdot", 8901); // dot operator, symbol dot enter_escape( L"sect", 167); // section sign enter_escape( L"shy", 173); // soft hyphen enter_escape( L"si", 15); // shift in enter_escape( L"sigma", 962); // Greek letter enter_escape( L"sigmaf", 963); // Greek letter enter_escape( L"sim", 8764); // similar to, similar to ~ enter_escape( L"so", 14); // shift out enter_escape( L"soh", 1); // start of header enter_escape( L"space", L' '); // space character enter_escape( L"spades", 9824); // spade card suit enter_escape( L"sqareroot", 8730); // square root symbol, radical sign enter_escape( L"sphericalangle", 8738); // spherical angle sign enter_escape( L"sqrt", 8730); // square root symbol, radical sign enter_escape( L"suchthat", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar enter_escape( L"divides", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar enter_escape( L"APLstile", 8739); // such that, APL stile, divides, dental click, visually ambiguous with vertical bar enter_escape( L"radic", 8730); // square root symbol, radical sign enter_escape( L"star", 9733); // star, asterisk variant enter_escape( L"stx", 2); // start of text enter_escape( L"subsetof", 8834); // subset of, enter_escape( L"sub", 8834); // subset of, stupid name from XML enter_escape( L"subsetorequal", 8838); // subset of or equal to enter_escape( L"sube", 8838); // subset of or equal to, stupid name from XML enter_escape( L"subst", 26); // substitute enter_escape( L"sum", 8721); // sum operator, but looks like Sigma enter_escape( L"supersetof", 8835); // superset of enter_escape( L"sup", 8835); // superset of, stupid name from XML enter_escape( L"sup1", 185); // superscript 1 enter_escape( L"sup2", 178); // superscript 2 enter_escape( L"sup3", 179); // superscript 3 enter_escape( L"supersetorequal", 8839); // superset of or equal to enter_escape( L"supe", 8839); // superset of or equal to, stupid name from XML enter_escape( L"syn", 22); // synchronization char enter_escape( L"szlig", 223); // german sz ligature, "sharp s" enter_escape( L"tab", 9); // tab enter_escape( L"t", 9); // tab, from c traditional inline escapes enter_escape( L"tau", 964); // Greek letter enter_escape( L"therefore", 8756); // therefore, three-dot proof sign enter_escape( L"there4", 8756); // therefore, three-dot proof sign, stupid name from XML enter_escape( L"theta", 952); // Greek letter enter_escape( L"thinsp", 8201); // thin space enter_escape( L"thorn", 254); // thorn, icelandic enter_escape( L"tilde", 732); // modifier letter small tilde enter_escape( L"times", 215); // multiplication sign enter_escape( L"trade", 8482); // trademark sign enter_escape( L"tripleasterisk", 8258); // triple-asterisk mark enter_escape( L"asterism", 8258); // triple-asterisk mark enter_escape( L"tripleintegral", 8749); // triple integral sign enter_escape( L"tripleprime", 8244); // double prime, for inches, seconds,etc. (name from XML) enter_escape( L"uArr", 8657); // upward double arrow enter_escape( L"uacute", 250); // u with acute accent enter_escape( L"uarr", 8593); // up arrow enter_escape( L"ucirc", 251); // u with circumflex enter_escape( L"ugrave", 249); // u with grave accent enter_escape( L"uml", 168); // umlaut/dieresis enter_escape( L"union", 8746); // Union operator enter_escape( L"cup", 8746); // Union operator, stupid name from XML enter_escape( L"upsih", 978); // Greek letter enter_escape( L"upsilon", 965); // Greek letter enter_escape( L"us", 31); // unit separator enter_escape( L"uuml", 252); // u with dieresis/umlaut enter_escape( L"verticalellipsis",8942); // vertical ellipsis enter_escape( L"vtab", 11); // vertical tab enter_escape( L"v", 11); // vertical tab enter_escape( L"weierp", 8472); // capital script P, power set symbol enter_escape( L"xi", 958); // Greek letter enter_escape( L"yacute", 253); // y with acute accent enter_escape( L"yen", 165); // Yen sign enter_escape( L"yuml", 255); // y with dieresis/umlaut enter_escape( L"zeta", 950); // Greek letter enter_escape( L"zwj", 8205); // zero width joiner enter_escape( L"zwnj", 8204); // zero width nonjoiner } int escapist::unescape_single(std::wstring buffer){ // We have a possible UEC in the buffer. We check it against the // escape table. If it is a UEC return the character. Else return // WEOF. int count = esc2char.count(buffer); if (count > 0) return(esc2char[buffer]); else{ size_t count; int codepoint; wchar_t next; bool valid = true; std::wistringstream charsin(buffer); // This is a stupid way to read/parse a hex unicode escape, but it // won't throw an exception, it won't return a wrong answer if the // input isn't what we expect, it's very clear to trace and // understand, and it'll do what I need done 'till I figure out // the right way to do it. for (count = 0; count < buffer.size();count++){ charsin >> next; if (count == 0 && next != L'\\') valid = false; if (count +1 == buffer.size() && (next != L'\\')) valid = false; if (count == 1 && (next != L'u' && next != L'U')) valid = false; if (count == 2 && (next != L'+')) valid = false; if (count == 3) { if (next >= L'0' && next <= L'9') codepoint = next - L'0'; else if (next >= L'A' && next <= L'F') codepoint = 10 + next - L'A'; else if (next >= L'a' && next <= L'f') codepoint = 10 + next - L'a'; else valid = false; } if (count > 3){ codepoint *= 16; if (next >= L'0' && next <= L'9') codepoint += (next - L'0'); else if (next >= L'A' && next <= L'F') codepoint += (10 + next - L'A'); else if (next >= L'a' && next <= L'f') codepoint += (10 + next - L'a'); else valid = false; } } if (valid) return (codepoint); } return(WEOF); } // returns a string giving a UEC for a character if one has been // registered. Otherwise returns an empty string. std::wstring escapist::escape_single(int testchar){ std::wstring retval; if (char2esc.count(testchar) > 0) retval += char2esc[testchar]; return(retval); } // returns a string giving a UEC for a character, regardless of // whether one has been registered. std::wstring escapist::escape_single_absolute(int testchar){ std::wostringstream escape; std::wstring reg = escape_single(testchar); if (reg.length() == 0) escape << L"\\U+" << std::hex << testchar << '\\'; else escape << reg; return(escape.str()); }
Here is my escapist class, some instance of which is a member of any lexer object. Simply explained, it’s two hash tables; one mapping escape codes to wide characters, and one mapping wide characters to escape codes. The first is keyed by escape code, so you can alias all the escape codes you want and all of them will work. The second is keyed by character, so you can have only one escape code per character. This is handled simply, by having the first escape code entered be the one that is returned when the character gets unescaped.
escapist::enter_standard_escapes just establishes a whole bunch of convenience names for characters that people will probably use a lot. There are also escape codes consisting solely of ‘U+’ and a unicode hex codepoint set between two backslashes for any other unicode character that someone wants to stick into their source code.
These escape codes simply don’t exist after they’re read; the source expressed with escape codes is exactly the same code as the source code expressed with wide characters. Once the system has read it, it doesn’t care what form it read it in. Thus, the escape codes are not part of the source; they’re simply a means of entering the source. That means, when you type the closing backslash on a form like \lambda\, the form disappears and the lowercase greek letter lambda itself goes into your source code.
The rest of the magic is in lexer::getchar(), which calls the “unescape_single” routine and manages the lexer’s queue of characters read but not yet parsed. Usually, it just checks its buffer to see if it has a character to return, and returns the first such character if so, removing it from the queue. If the queue is empty, then it reads a character, and if it’s not a backslash, returns it. Finding a backslash requires some reading ahead, either to a closing backslash, an I/O error, or the maximum length an escape sequence can be. So until one of those things happens, characters are read into the queue. If there’s a closing backslash, we call unescape_single to try to match it with an escape code. Otherwise, or if there’s no match, lexer::getchar just deletes the initial backslash from the queue and return that.
So, anyway; the escapist is a very simple object. It didn’t even need a custom constructor.