Unicode, some more.

 

In a previous post, I mentioned that I would be posting code to do with implementing Unicode in a programming language, along with some more commentary. Well, this is that post.

 

The first question, of course, is why an ‘implementation’ is needed at all.  We don’t ‘implement’ ASCII, after all; we just type it and the computer understands it.   In theory, our computers now understand Unicode in the same way, so we can just give it programs that include characters not found in ASCII and it should work.

 

Nice theory.  There are a couple of problems with this view of the universe. One is on the desk in front of me, and probably on a different desk in front of you. It’s called a keyboard.  Mine sure as heck doesn’t have a key for each Unicode character; I don’t think anyone else’s does either.  We can access maybe 200 characters depending on our alt-gr configuration — but most people don’t know how to type more than about 120 of them, if that.  So, our first problem is that we can’t type more than a few characters.

 

My approach is to implement universal escape codes. We’ve used programming languages that have ‘escape syntax’ for putting things into, eg, strings and regular expressions that otherwise would be interpreted as syntax that does undesirable things to those constructs (for example, a quote-mark has to be escaped in a string, or it’ll be read as closing the string). Those are syntax-sensitive escape codes; they exist to overcome syntax problems, and a slightly different set are required for each specialized syntax. By contrast, I’m implementing universal, or syntax-neutral, escape codes. These are universal in that they stand for the same character, regardless of whether they appear in a comment or in a string or in an identifier. So if someone uses one of these escape codes for a quotation mark, say, it will be read as the end of the string. They are simply a way to input characters, and no more than that.

 

So here’s some code to do that.

class escapist{
public:
  escapist();

  // initializer. 
  void enter_standard_escapes();

  int unescape_single(std::wstring buffer);

  std::wstring escape_single(int input);

  std::wstring escape_single_absolute(int testchar);

private:

  void enter_escape(std::wstring escseq, wchar_t standsfor);

  // maps every escape sequence to a corresponding character 
  std::unordered_map< std::wstring, wchar_t > esc2char;

  // maps characters to their preferred escape sequence.
  std::unordered_map< wchar_t, std::wstring > char2esc;
};

// constructs an empty escapist object. 
escapist::escapist(){
  // std::wcout << L"allocating escapist \n"; 
} 

void escapist::enter_escape(std::wstring escapeseq, wchar_t standsfor){
  std::wstring bescapeseqb;

  // every universal escape sequence starts and ends with a backslash,
  // and otherwise contains only alphabetic characters.  These lines
  // are just adding a backslash to the alpha strings to render them
  // in the right form for universal escape sequences.

  bescapeseqb += L"\\";
  bescapeseqb += escapeseq;
  bescapeseqb += L"\\";

  int checkval;
  wchar_t charcheck;

  // every escape sequence is unique; therefore every insertion in 
  // esc2char will succeed. But we're checking and asserting anyway.

  checkval = esc2char.count(bescapeseqb);
  assert(checkval == 0);
  esc2char[bescapeseqb] = standsfor;  
  checkval = esc2char.count(bescapeseqb);
  assert(checkval == 1);
  charcheck = esc2char[bescapeseqb];
  assert(charcheck == standsfor);

  // reverse lookups otoh key on the character, which is not unique;
  // therefore we check, and enter something only if nothing is currently 
  // entered.  This makes the FIRST escape sequence entered be the one that
  // is used for output. 

  if (char2esc.count(standsfor) == 0)
    char2esc[standsfor] = bescapeseqb;
  checkval = char2esc.count(standsfor);
  assert(checkval = 1);
}

void escapist::enter_standard_escapes(){

  // note: where multiple aliases exist for the same character, any of
  // them can be used to input the character.  If ascii-only rendering
  // of code is requested, the system will use only the alais entered
  // first as its output form.  This motivates a few deviations from
  // alphabetic ordering below, but otherwise this list is alphabetic
  // (treating all capitals as preceding all lower-case letters) by
  // alias.

  // the values of the escape sequences are two characters longer than
  // the string provided as an argument, and must be shorter than
  // MAXESCAPE.

  //             These are thirty-two characters.
  //             12345678901234567890123456789012
  enter_escape( L"AElig",           198);    // AE ligature 
  enter_escape( L"AND",            8743);    // logical AND operator 
  enter_escape( L"Aacute",          193);    // A with acute 
  enter_escape( L"Acirc",           194);    // A with circle 
  enter_escape( L"Agrave",          192);    // A with grave 
  enter_escape( L"Alpha",           913);    // Greek letter (against better judgement; visually ambiguous with 'A')
  enter_escape( L"Aring",           197);    // A with ring above 
  enter_escape( L"Atilde",          195);    // A with tilde 
  enter_escape( L"Auml",            196);    // A with umlaut 
  enter_escape( L"Beta",            914);    // Greek letter 
  enter_escape( L"Ccedil",          199);    // C with cedilla 
  enter_escape( L"Chi",             935);    // Greek letter 
  enter_escape( L"Dagger",         8225);    // double dagger 
  enter_escape( L"Delta",           916);    // Greek letter 
  enter_escape( L"ETH",             208);    // capital Eth (nordic/icelandic) 
  enter_escape( L"Eacute",          201);    // E with acute accent 
  enter_escape( L"Ecirc",           202);    // E with circumflex 
  enter_escape( L"Egrave",          200);    // E with grave accent 
  enter_escape( L"Epsilon",         917);    // Greek letter 
  enter_escape( L"Eta",             919);    // Greek letter 
  enter_escape( L"Euml",            203);    // E with dieresis/umlaut
  enter_escape( L"Gamma",           915);    // Greek letter 
  enter_escape( L"Iacute",          205);    // I with acute accent 
  enter_escape( L"Icirc",           206);    // I with circumflex 
  enter_escape( L"Igrave",          204);    // I with grave accent 
  enter_escape( L"Iota",            921);    // Greek letter 
  enter_escape( L"Iuml",            207);    // I with dieresis/umlaut
  enter_escape( L"Kappa",           922);    // Greek letter 
  enter_escape( L"Lambda",          923);    // Greek letter 
  enter_escape( L"Mu",              924);    // Greek letter 
  enter_escape( L"NAND",           8892);    // logical NAND sign, AND with overbar
  enter_escape( L"NOR",            8893);    // logical NOR sign, OR with overbar
  enter_escape( L"NOT",             172);    // logical NOT operator
  enter_escape( L"Ntilde",          209);    // capital N with tilde 
  enter_escape( L"Nu",              925);    // Greek letter 
  enter_escape( L"OElig",           338);    // OE ligature 
  enter_escape( L"OR",             8744);    // logical OR operator
  enter_escape( L"Oacute",          211);    // O with acute accent 
  enter_escape( L"Ocirc",           212);    // O with circumflex 
  enter_escape( L"Ograve",          210);    // O with grave accent 
  enter_escape( L"Omega",           937);    // Greek letter 
  enter_escape( L"Omicron",         927);    // Greek letter 
  enter_escape( L"Oslash",          216);    // O with slash 
  enter_escape( L"Otilde",          213);    // O with Tilde 
  enter_escape( L"Ouml",            214);    // O with dieresis/umlaut
  enter_escape( L"Phi",             934);    // Greek letter 
  enter_escape( L"Pi",              928);    // Greek letter 
  enter_escape( L"Prime",          8243);    // double prime, for inches, seconds,etc. (name from XML)
  enter_escape( L"Psi",             936);    // Greek letter 
  enter_escape( L"Rho",             929);    // Greek letter 
  enter_escape( L"Sigma",           931);    // Greek letter 
  enter_escape( L"THORN",           222);    // capital Thorn,icelandic 
  enter_escape( L"Tau",             932);    // Greek letter 
  enter_escape( L"Theta",           920);    // Greek letter 
  enter_escape( L"Uacute",          218);    // I with acute accent 
  enter_escape( L"Ucirc",           219);    // I with circumflex 
  enter_escape( L"Ugrave",          217);    // I with grave accent 
  enter_escape( L"Upsilon",         933);     // Greek letter 
  enter_escape( L"Uuml",            220);     // I with dieresis/umlaut
  enter_escape( L"Xi",              926);     // Greek letter 
  enter_escape( L"XOR",             8891);    // XOR sign, OR with underbar   
  enter_escape( L"Yacute",           221);    // Y with acute accent 
  enter_escape( L"Yuml",             376);    // Y with dieresis/umlaut 
  enter_escape( L"Zeta",             918);    // Greek letter 
  enter_escape( L"bell",               7);    // console bell
  enter_escape( L"alarm",              7);    // traditional c inline escape
  enter_escape( L"a",                  7);    // alarm, aka bell
  enter_escape( L"aacute",           225);    // a with acute 
  enter_escape( L"acirc",            226);    // a with circle 
  enter_escape( L"ack",                6);    // acknowledge 
  enter_escape( L"acute",            180);    // acute accent 
  enter_escape( L"aelig",            230);    // ae ligature 
  enter_escape( L"agrave",           224);    // a with grave
  enter_escape( L"bel",                7);    // bell, rendering in ascii tables
  enter_escape( L"alefsym",         8501);    // first transfinite cardinal 
  enter_escape( L"alpha",            945);    // Greek letter 
  enter_escape( L"amp",              '&');    // ampersand, unnecessary but exists in XML
  enter_escape( L"angle",           8736);    // angle symbol 
  enter_escape( L"apos",              27);    // apostrophe
  enter_escape( L"aring",            229);    // a with ring above 
  enter_escape( L"approxequal",     8773);    // approximately equal to 
  enter_escape( L"almostequal",     8776);    // almost equal to 

  // note, XML got this wrong.  They have a named entity ≈ which
  // they bill as 'asymptotically equal to', but give the code for
  // 'almost equal' above.  Unicode has a different character which
  // means asymptotic equality, given below. I'm handling it by NOT
  // supporting the XML-derived name \asymp\, which may surprise users 
  // to some extent.  The other choices were to repeat the mistake for 
  // compatibility, or correct the mistake and subject the user to even
  // more (more unpleasant) surprise.

  enter_escape( L"asympequal",      8771);    // asymptotically equal to
  enter_escape( L"atilde",           227);    // a with tilde 
  enter_escape( L"auml",             228);    // a with umlaut 
  enter_escape( L"bdquo",           8222);    // double low-9 quotation mark
  enter_escape( L"because",         8757);    // because sign, inverted therefore sign
  enter_escape( L"beta",             946);    // Greek letter 
  enter_escape( L"brvbar",           166);    // broken vertical bar 
  enter_escape( L"backspace",          8);    // backspace
  enter_escape( L"bs",                 8);    // backspace 
  enter_escape( L"b",                  8);    // backspace
  enter_escape( L"circledequal",    8860);    // bitwise equality, circled equals
  enter_escape( L"circleddash",     8861);    // circled dash  
  enter_escape( L"circleddot",      8857);    // bitwise AND operator, circled dot
  enter_escape( L"circledtimes",    8855);    // bitwise NAND operator, circled times
  enter_escape( L"circledplus",     8853);    // bitwise XOR operator, circled plus
  enter_escape( L"circledminus",    8854);    // bitwise NOT operator, circled minus
  enter_escape( L"circledslash",    8856);    // bitwise OR operator, circled slash
  enter_escape( L"circledring",     8858);    // bitwise NOR operator, circled ring
  enter_escape( L"circledstar",     8859);    // bitwise XNOR operator, circled asterisk
  enter_escape( L"bullet",          8226);    // punctuation 
  enter_escape( L"can",               24);    // no idea ... but shown in ASCII tables
  enter_escape( L"ccedil",           231);    // c with cedilla 
  enter_escape( L"cedil",            184);    // cedilla 
  enter_escape( L"cent",             162);    // cents 
  enter_escape( L"checkmark",      10003);    // checkmark 
  enter_escape( L"crossoff",       10007);    // 'x' mark for completion
  enter_escape( L"chi",              967);    // Greek letter 
  enter_escape( L"circ",             710);    // modifier letter circumflex accent 
  enter_escape( L"clubs",           9827);    // club card suit 
  enter_escape( L"cong",            8773);    // approximately equal to 
  enter_escape( L"complement",      8705);    // complement of
  enter_escape( L"contains",        8715);    // contains as member 
  enter_escape( L"contourintegral", 8750);    // integral sign on a circle
  enter_escape( L"copy",             169);    // copyright sign 
  enter_escape( L"crarr",           8629);    // down arrow with corner left,       "carriage return arrow" 
  enter_escape( L"cuberoot",        8731);    // square root symbol, radical sign 
  enter_escape( L"curren",           164);    // currency 
  enter_escape( L"dArr",            8659);    // downward double arrow 
  enter_escape( L"dagger",          8224);    // dagger mark, asterisk variant 
  enter_escape( L"darr",            8595);    // down arrow 
  enter_escape( L"dc1",               17);    // device control 1 
  enter_escape( L"dc2",               18);    // device control 2 
  enter_escape( L"dc3",               19);    // device control 3 
  enter_escape( L"dc4",               20);    // device control 4 
  enter_escape( L"degrees",          176);    // degrees 
  enter_escape( L"deg",              176);    // degrees 
  enter_escape( L"del",              127);    // delete character 
  enter_escape( L"delta",            948);    // Greek letter 
  enter_escape( L"diamonds",        9830);    // diamond card suit 
  enter_escape( L"diams",           9830);    // diamond card suit 
  enter_escape( L"divide",           247);    // division sign 
  enter_escape( L"doesnotcontain",  8716); 
  enter_escape( L"doesnotdivide",   8740);    
  enter_escape( L"dottedequal",     8784);    // approaches the limit, dotted equality
  enter_escape( L"approaches",      8784);    // approaches the limit, dotted equality
  enter_escape( L"dottedminus",     8760);    // dotted minus sign
  enter_escape( L"dottedplus",      8724);    // dotted plus sign
  enter_escape( L"doubleasterisk",  8273);    // double asterisk, vertically aligned
  enter_escape( L"doubledagger",    8225);    // double dagger, asterisk variant 
  enter_escape( L"doubleexcl",      8252);    // double exclamation mark
  enter_escape( L"doubleintegral",  8748);    // double integral sign  
  enter_escape( L"doubleprime",     8243);    // double prime, for inches, seconds,etc. (name from XML)
  enter_escape( L"doublequestion",  8263);    // double question mark
  enter_escape( L"dle",               16);    // data link escape 
  enter_escape( L"eacute",           233);    // e with acute accent 
  enter_escape( L"ecirc",            234);    // e with circumflex 
  enter_escape( L"egrave",           232);    // e with grave accent 
  enter_escape( L"ellipsis",        8943);    // triple-dot ellipsis
  enter_escape( L"em",                25);    // don't know, but it has a name in ASCII tables
  enter_escape( L"empty",           8709);    // empty set, null set 
  enter_escape( L"emsp",            8195);    // em space 
  enter_escape( L"enq",                5);    // enquire 
  enter_escape( L"ensp",            8194);    // en space 
  enter_escape( L"eot",                4);    // end of trasmission 
  enter_escape( L"epsilon",          949);    // Greek letter 
  enter_escape( L"equivalent",      8801);    // identical, triple-bar sign
  enter_escape( L"equiv",           8801);    // identical with
  enter_escape( L"esc",               27);    // escape character 
  enter_escape( L"eta",              951);    // Greek letter 
  enter_escape( L"etb",               23);    // end transmission block
  enter_escape( L"eth",              240);    // lowercase icelandic eth 
  enter_escape( L"etx",                3);    // end of text
  enter_escape( L"euml",             235);    // e with dieresis/umlaut
  enter_escape( L"exist",           8707);    // there exists 
  enter_escape( L"fallingellipsis", 8945);    // ellipsis from upper left to bottom right
  enter_escape( L"formfeed",          12);    // form feed
  enter_escape( L"ff",                12);    // form feed 
  enter_escape( L"f",                 12);    // form feed 
  enter_escape( L"forall",          8704);    // forall quantification operator 
  enter_escape( L"fourthroot",      8732);    // square root symbol, radical sign 
  enter_escape( L"frac12",           189);    // fraction 1/2 
  enter_escape( L"frac14",           188);    // fraction 1/4 
  enter_escape( L"frac34",           190);    // fraction 3/4 
  enter_escape( L"frasl",           8260);    // fraction slash 
  enter_escape( L"fs",                28);    // file separator
  enter_escape( L"gamma",            947);    // Greek letter 
  enter_escape( L"greaterorequal",  8805);    // greater or equal to
  enter_escape( L"ge",              8805);    // greater or equal to, stupid name from XML
  enter_escape( L"gs",                29);    // group separator, from ASCII tables
  enter_escape( L"gt",               '>');    // greater-than, probably unneeded but exists in XML
  enter_escape( L"hArr",            8660);    // double ended double horizontal arrow 
  enter_escape( L"harr",            8596);    // double ended horizontal arrow 
  enter_escape( L"hearts",          9829);    // heart card suit,       valentine symbol 
  enter_escape( L"hellip",          8230);    // horizontal ellipsis 
  enter_escape( L"iacute",           237);    // i with acute accent 
  enter_escape( L"icirc",            238);    // i with circumflex 
  enter_escape( L"iexcl",            161);    // inverted exclamation mark 
  enter_escape( L"igrave",           236);    // i with grave accent 
  enter_escape( L"image",           8465);    // blackletter cap I/imaginary part
  enter_escape( L"increment",       8710);    // increment sign
  enter_escape( L"infinity",        8734);    // infinity symbol, lemniscate 
  enter_escape( L"infin",           8734);    // infinity symbol, lemniscate, abbreviation from XML
  enter_escape( L"integral",        8747);    // integral sign
  enter_escape( L"int",             8747);    // integral sign, ambiguous name from XML
  enter_escape( L"intersection",    8745);    // Intersection operator 
  enter_escape( L"interrobang",     8253);    // Interrobang
  enter_escape( L"cap",             8745);    // Intersection operator, stupid name from XML
  enter_escape( L"iota",             953);    // Greek letter 
  enter_escape( L"iquest",           191);    // inverted question mark 
  enter_escape( L"isin",            8712);    // is an element of,       is a member of
  enter_escape( L"elementof",       8712);    // is an element of, is a member of
  enter_escape( L"iuml",             239);    // i with dieresis/umlaut
  enter_escape( L"kappa",            954);    // Greek letter 
  enter_escape( L"lArr",            8656);    // left double arrow/implication sign
  enter_escape( L"lambda",           955);    // Greek letter 
  enter_escape( L"lang",            9001);    // left pointing angle bracket 
  enter_escape( L"laquo",            171);    // left double angle quote, left guillemet 
  enter_escape( L"larr",            8592);    // left arrow 
  enter_escape( L"lceil",           8968);    // left ceiling, upstile 
  enter_escape( L"ldquo",           8220);    // left double quote 
  enter_escape( L"lessorequal",     8804);    // less than or equal to
  enter_escape( L"le",              8804);    // less than or equal to, stupid name from XML
  enter_escape( L"lfloor",          8970);    // left floor,       downstile 
  enter_escape( L"lowast",          8727);    // low asterisk, asterisk operator 
  enter_escape( L"loz",             9674);    // lozenge shape 
  enter_escape( L"lrm",             8206);    // left to right mark 
  enter_escape( L"lt",               '<');    // less-than character, 60, probably unneeded but exists in XML
  enter_escape( L"lsaquo",          8249);    // left-pointing single angle quote 
  enter_escape( L"lsquo",           8216);    // left single quote 
  enter_escape( L"macr",             175);    // macron, APL overbar 
  enter_escape( L"mdash",           8212);    // em dash 
  enter_escape( L"measuredangle",   8737);    // measured angle
  enter_escape( L"micr",             181);    // micro sign 
  enter_escape( L"middot",           183);    // middle dot 
  enter_escape( L"minus",           8722);    // subtaction operator, against better judgement; visually ambiguous with hyphen. 
  enter_escape( L"minusplus",       8723);    // like plusminus, but minus symbol is on top of plus in this one 
  enter_escape( L"mu",               956);    // Greek letter 
  enter_escape( L"muchlessthan",    8810);    // much less than, double less than sign 
  enter_escape( L"muchgreaterthan", 8811);    // much less than, double greater than sign 
  enter_escape( L"nabla",           8711);    // backward difference 
  enter_escape( L"nak",               21);    // negative acknowledge 
  enter_escape( L"nbsp",             160);    // nonbreaking space 
  enter_escape( L"ndash",           8211);    // en dash
  enter_escape( L"newline",           10);    // unix-standard newline character 
  enter_escape( L"n",                 10);    // newline character, from traditional inline escapes
  enter_escape( L"ni",              8715);    // contains as member, stupid name from  XML
  enter_escape( L"not",              172);    // Logical NOT operator, not sign 
  enter_escape( L"notapproxequal",  8775);    // not approximately equal to 
  enter_escape( L"notalmostequal",  8777);    // not almost equal to 
  enter_escape( L"notasympequal",   8772);    // not asymptotically equal to
  enter_escape( L"notequal",        8800);    // not equal to 
  enter_escape( L"notequivalent",   8802);    // not identical to, crossed triple-bar sign  
  enter_escape( L"ne",              8800);    // not equal to, stupid name from XML
  enter_escape( L"notexist",        8708);    // there does not exist
  enter_escape( L"notin",           8713);    // not a member of 
  enter_escape( L"notelement",      8713);    // not an element of
  enter_escape( L"notlessthan",     8714);    // not less than
  enter_escape( L"notgreaterthan",  8715);    // not greater than
  enter_escape( L"notparallel",     8742);    // not parallel to, crossed double vertical bar
  enter_escape( L"notsubset",       8836);    // not a subset of 
  enter_escape( L"nsub",            8836);    // not a subset of, stupid name from XML
  enter_escape( L"notsuperset",     8837);    // not a superset of   
  enter_escape( L"nsup",            8837);    // not a superset of, stupid name for symmetry with nsub from XML.
  enter_escape( L"ntilde",           241);    // n with tilde 
  enter_escape( L"nu",               957);    // Greek letter 
  enter_escape( L"null",               0);    // null
  enter_escape( L"nul",                0);    // null 
  enter_escape( L"oacute",           243);    // o with acute accent 
  enter_escape( L"ocirc",            244);    // o with circumflex 
  enter_escape( L"oelig",            339);    // oe ligature 
  enter_escape( L"ograve",           242);    // o with grave accent 
  enter_escape( L"oline",           8254);    // spacing overline 
  enter_escape( L"omega",            969);    // Greek letter 
  enter_escape( L"omicron",          959);    // Greek letter 
  enter_escape( L"oplus",           8853);    // circled plus sign, bitwise OR, name from XML
  enter_escape( L"ordf",             170);    // feminine ordinal indicator 
  enter_escape( L"ordm",             186);    // masculine ordinal indicator 
  enter_escape( L"oslash",           248);    // o with slash 
  enter_escape( L"otilde",           245);    // o with Tilde 
  enter_escape( L"otimes",          8855);    // circled multiplication sign, bitwise XOR
  enter_escape( L"ouml",             246);    // o with dieresis/umlaut
  enter_escape( L"paragraph",        182);    // pilcrow / paragraph sign 
  enter_escape( L"para",             182);    // pilcrow / paragraph sign 
  enter_escape( L"parallel",        8741);    // parallel to, double vertical bar
  enter_escape( L"part",            8706);    // partial differential sign 
  enter_escape( L"permille",        8240);    // permille sign 
  enter_escape( L"permil",          8240);    // permille sign 
  enter_escape( L"perp",            8869);    // up tack, perpendicular to 
  enter_escape( L"phi",              966);    // Greek letter 
  enter_escape( L"pi",               960);    // Greek letter 
  enter_escape( L"pilcrow",          182);    // pilcrow / paragraph sign 
  enter_escape( L"piv",              982);    // Greek letter 
  enter_escape( L"plusminus",        177);    // plus or minus 
  enter_escape( L"plusmn",           177);    // plus or minus 
  enter_escape( L"pound",            163);    // pounds sterling 
  enter_escape( L"powerset",        8472);    // capital script P, power set symbol
  enter_escape( L"prime",           8242);    // for feet, minutes, etc. 
  enter_escape( L"prod",            8719);    // product operator, looks like Pi 
  enter_escape( L"prop",            8733);    // proportional to 
  enter_escape( L"propersubset",    8842);    // proper subset of, subset with not equal to
  enter_escape( L"propersuperset",  8843);    // proper superset of, superset with not equal to
  enter_escape( L"psi",              968);    // Greek letter 
  enter_escape( L"quot",            L'"');    // quotation mark, probably unnecessary but exists in XML
  enter_escape( L"return",            13);    // dos/win end of line character 
  enter_escape( L"r",              L'\r');    // return, 13 
  enter_escape( L"rArr",            8658);    // right double arrow/implication sign
  enter_escape( L"rang",            9002);    // right pointing angle bracket 
  enter_escape( L"raquo",            187);    // right double angle quote, right guillemet
  enter_escape( L"rarr",            8594);    // right arrow 
  enter_escape( L"rceil",           8969);    // right ceiling 
  enter_escape( L"rdquo",           8221);    // right double quote 
  enter_escape( L"real",            8476);    // blackletter cap R/real part 
  enter_escape( L"refmark",         8251);    // reference mark, asterisk variant
  enter_escape( L"reg",              174);    // registered sign 
  enter_escape( L"rfloor",          8971);    // right floor 
  enter_escape( L"rho",              961);    // Greek letter 
  enter_escape( L"rightangle",      8735);    // angle symbol 
  enter_escape( L"risingellipsis",  8944);    // ellipsis from bottom left to upper right
  enter_escape( L"rlm",             8207);    // right to left mark
  enter_escape( L"rs",                30);    // record separator
  enter_escape( L"rsaquo",          8250);    // right-pointing single angle quote 
  enter_escape( L"rsquo",           8217);    // right single quote 
  enter_escape( L"sbquo",           8218);    // single low-9 quotation mark
  enter_escape( L"scaron",           353);    // s with caron 
  enter_escape( L"sdot",            8901);    // dot operator,      symbol dot 
  enter_escape( L"sect",             167);    // section sign 
  enter_escape( L"shy",              173);    // soft hyphen 
  enter_escape( L"si",                15);    // shift in 
  enter_escape( L"sigma",            962);    // Greek letter 
  enter_escape( L"sigmaf",           963);    // Greek letter 
  enter_escape( L"sim",             8764);    // similar to,      similar to ~ 
  enter_escape( L"so",                14);    // shift out 
  enter_escape( L"soh",                1);    // start of header 
  enter_escape( L"space",           L' ');    // space character 
  enter_escape( L"spades",          9824);    // spade card suit 
  enter_escape( L"sqareroot",       8730);    // square root symbol, radical sign 
  enter_escape( L"sphericalangle",  8738);    // spherical angle sign
  enter_escape( L"sqrt",            8730);    // square root symbol, radical sign 
  enter_escape( L"suchthat",        8739);    // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
  enter_escape( L"divides",         8739);    // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
  enter_escape( L"APLstile",        8739);    // such that, APL stile, divides, dental click, visually ambiguous with vertical bar
  enter_escape( L"radic",           8730);    // square root symbol, radical sign 
  enter_escape( L"star",            9733);    // star, asterisk variant
  enter_escape( L"stx",                2);    // start of text 
  enter_escape( L"subsetof",        8834);    // subset of, 
  enter_escape( L"sub",             8834);    // subset of, stupid name from XML 
  enter_escape( L"subsetorequal",   8838);    // subset of or equal to
  enter_escape( L"sube",            8838);    // subset of or equal to, stupid name from XML
  enter_escape( L"subst",             26);    // substitute
  enter_escape( L"sum",             8721);    // sum operator, but looks like Sigma 
  enter_escape( L"supersetof",      8835);    // superset of 
  enter_escape( L"sup",             8835);    // superset of, stupid name from XML
  enter_escape( L"sup1",             185);    // superscript 1 
  enter_escape( L"sup2",             178);    // superscript 2 
  enter_escape( L"sup3",             179);    // superscript 3 
  enter_escape( L"supersetorequal", 8839);    // superset of or equal to
  enter_escape( L"supe",            8839);    // superset of or equal to, stupid name from XML
  enter_escape( L"syn",               22);    // synchronization char 
  enter_escape( L"szlig",            223);    // german sz ligature, "sharp s"
  enter_escape( L"tab",                9);    // tab 
  enter_escape( L"t",                  9);    // tab, from c traditional inline escapes
  enter_escape( L"tau",              964);    // Greek letter 
  enter_escape( L"therefore",       8756);    // therefore, three-dot proof sign
  enter_escape( L"there4",          8756);    // therefore, three-dot proof sign, stupid name from XML
  enter_escape( L"theta",            952);    // Greek letter 
  enter_escape( L"thinsp",          8201);    // thin space 
  enter_escape( L"thorn",            254);    // thorn, icelandic 
  enter_escape( L"tilde",            732);    // modifier letter small tilde 
  enter_escape( L"times",            215);    // multiplication sign 
  enter_escape( L"trade",           8482);    // trademark sign 
  enter_escape( L"tripleasterisk",  8258);    // triple-asterisk mark
  enter_escape( L"asterism",        8258);    // triple-asterisk mark
  enter_escape( L"tripleintegral",  8749);    // triple integral sign  
  enter_escape( L"tripleprime",     8244);    // double prime, for inches, seconds,etc. (name from XML)
  enter_escape( L"uArr",            8657);    // upward double arrow 
  enter_escape( L"uacute",           250);    // u with acute accent 
  enter_escape( L"uarr",            8593);    // up arrow 
  enter_escape( L"ucirc",            251);    // u with circumflex 
  enter_escape( L"ugrave",           249);    // u with grave accent 
  enter_escape( L"uml",              168);    // umlaut/dieresis 
  enter_escape( L"union",           8746);    // Union operator 
  enter_escape( L"cup",             8746);    // Union operator, stupid name from XML
  enter_escape( L"upsih",            978);    // Greek letter 
  enter_escape( L"upsilon",          965);    // Greek letter 
  enter_escape( L"us",                31);    // unit separator
  enter_escape( L"uuml",             252);    // u with dieresis/umlaut
  enter_escape( L"verticalellipsis",8942);    // vertical ellipsis 
  enter_escape( L"vtab",              11);    // vertical tab 
  enter_escape( L"v",                 11);    // vertical tab 
  enter_escape( L"weierp",          8472);    // capital script P, power set symbol
  enter_escape( L"xi",               958);    // Greek letter 
  enter_escape( L"yacute",           253);    // y with acute accent 
  enter_escape( L"yen",              165);    // Yen sign 
  enter_escape( L"yuml",             255);    // y with dieresis/umlaut 
  enter_escape( L"zeta",             950);    // Greek letter 
  enter_escape( L"zwj",             8205);    // zero width joiner 
  enter_escape( L"zwnj",            8204);    // zero width nonjoiner 
}


int escapist::unescape_single(std::wstring buffer){
  // We have a possible UEC in the buffer.  We check it against the
  // escape table.  If it is a UEC return the character. Else return
  // WEOF.
  
  int count = esc2char.count(buffer);
  if (count > 0)
    return(esc2char[buffer]);
  else{
    size_t count;
    int codepoint;
    wchar_t next;
    bool valid = true;
    std::wistringstream charsin(buffer);
    // This is a stupid way to read/parse a hex unicode escape, but it
    // won't throw an exception, it won't return a wrong answer if the
    // input isn't what we expect, it's very clear to trace and
    // understand, and it'll do what I need done 'till I figure out
    // the right way to do it.
    for (count = 0; count < buffer.size();count++){
      charsin >> next;
      if (count == 0 && next != L'\\') valid = false;
      if (count +1 == buffer.size() && (next != L'\\')) valid = false;
      if (count == 1 && (next != L'u' && next != L'U')) valid = false;
      if (count == 2 && (next != L'+')) valid = false;
      if (count == 3) {
    if (next >= L'0' && next <= L'9') codepoint = next - L'0';
    else if (next >= L'A' && next <= L'F') codepoint = 10 + next - L'A';
    else if (next >= L'a' && next <= L'f') codepoint = 10 + next - L'a';
    else valid = false;
      }
      if (count > 3){
    codepoint *= 16;
    if (next >= L'0' && next <= L'9') codepoint += (next - L'0');
    else if (next >= L'A' && next <= L'F') codepoint += (10 + next - L'A');
    else if (next >= L'a' && next <= L'f') codepoint += (10 + next - L'a');
    else valid = false;    
      }
    }
    if (valid) return (codepoint);
  }
  return(WEOF);
}


// returns a string giving a UEC for a character if one has been
// registered.  Otherwise returns an empty string.
std::wstring escapist::escape_single(int testchar){
  std::wstring retval;
  if (char2esc.count(testchar) > 0)
    retval += char2esc[testchar];
  return(retval);
}

// returns a string giving a UEC for a character, regardless of
// whether one has been registered.
std::wstring escapist::escape_single_absolute(int testchar){
  std::wostringstream escape;
  std::wstring reg = escape_single(testchar);
  if (reg.length() == 0)
    escape << L"\\U+" << std::hex << testchar << '\\';
  else 
    escape << reg;
  return(escape.str());  
}

 

Here is my escapist class, some instance of which is a member of any lexer object. Simply explained, it’s two hash tables; one mapping escape codes to wide characters, and one mapping wide characters to escape codes. The first is keyed by escape code, so you can alias all the escape codes you want and all of them will work. The second is keyed by character, so you can have only one escape code per character. This is handled simply, by having the first escape code entered be the one that is returned when the character gets unescaped.

 

escapist::enter_standard_escapes just establishes a whole bunch of convenience names for characters that people will probably use a lot. There are also escape codes consisting solely of ‘U+’ and a unicode hex codepoint set between two backslashes for any other unicode character that someone wants to stick into their source code.

 

These escape codes simply don’t exist after they’re read; the source expressed with escape codes is exactly the same code as the source code expressed with wide characters. Once the system has read it, it doesn’t care what form it read it in. Thus, the escape codes are not part of the source; they’re simply a means of entering the source. That means, when you type the closing backslash on a form like \lambda\, the form disappears and the lowercase greek letter lambda itself goes into your source code.

 

The rest of the magic is in lexer::getchar(), which calls the “unescape_single” routine and manages the lexer’s queue of characters read but not yet parsed. Usually, it just checks its buffer to see if it has a character to return, and returns the first such character if so, removing it from the queue. If the queue is empty, then it reads a character, and if it’s not a backslash, returns it. Finding a backslash requires some reading ahead, either to a closing backslash, an I/O error, or the maximum length an escape sequence can be. So until one of those things happens, characters are read into the queue. If there’s a closing backslash, we call unescape_single to try to match it with an escape code. Otherwise, or if there’s no match, lexer::getchar just deletes the initial backslash from the queue and return that.

So, anyway; the escapist is a very simple object. It didn’t even need a custom constructor.

Leave a Reply