D 2017-10-11T23:01:13.166 L handling\sunicode N text/x-markdown P ddbe4b5d463909c0c250c3b2325fbb2f9716f1e4 U aku W 2286 Up: [Notes](wiki?name=Notes) # Unicode * Generally expect the input in UTF-8 encoding. * Characters are __byte__-sequences, 1 to 4 bytes long. # Engine * The lexer engine takes bytes as input, not characters. # Grammars * To compensate for the engine above character-based grammars are rewritten into a byte-based form where multi-byte characters are represented by rules reflecting their byte sequences. * Character classes become alternatives of characters, which in turn are byte sequences. * It is possible to optimize the above into a set of alternatives of sequences of byte-ranges. Implementations for [Normalization](https://core.tcl.tk/akupries/marpa/artifact/9afbcbfde5c5c546?ln=117-244) and [Reduction](https://core.tcl.tk/akupries/marpa/artifact/9afbcbfde5c5c546?ln=496-765). The first simplifies literals without touching their literal-ness, i.e. the result of normalizing a literal is still a literal. The second goes further, able to break a literal apart into a collection of priority-rules representing sequences and alternates of simpler literals. Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]). As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them. __Note__: The current C runtime supports only bytes and the grammar reducer targeting it breaks byte-ranges apart as well. # Relevant references * Russ Cox's [Regular Expression Matching in the Wild](https://swtch.com/~rsc/regexp/regexp3.html), see Step 3. * Google's [RE2](https://github.com/google/re2) (also Russ Cox) * BurntSushi's Rust crate for [utf8-ranges](https://github.com/BurntSushi/utf8-ranges) * Lucene's [UTF-32 to UTF-8](https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java) * [Stefan](http://wiki.tcl.tk/44258)'s [Tcl/Regex re-implementation](https://chiselapp.com/user/stefank/repository/tclstuff/doc/trunk/www/regex.html) * [Generic unicode table reader (python)](https://github.com/google/re2/blob/master/re2/unicode.py) Z 84c38feab4fddeb7d684396b931be911