Marpa: Artifact [a6c2e0580e]

Artifact a6c2e0580ec0298807b4c908ad29f9fdee521a25:

Wiki page [handling unicode] by aku 2017-10-11 23:01:13.
D 2017-10-11T23:01:13.166
L handling\sunicode
N text/x-markdown
P ddbe4b5d463909c0c250c3b2325fbb2f9716f1e4
U aku
W 2286
Up: [Notes](wiki?name=Notes)

# Unicode

 * Generally expect the input in UTF-8 encoding.
 * Characters are __byte__-sequences, 1 to 4 bytes long.

# Engine

 * The lexer engine takes bytes as input, not characters.

# Grammars

 * To compensate for the engine above character-based grammars are rewritten into a byte-based form where multi-byte characters are represented by rules reflecting their byte sequences.
 * Character classes become alternatives of characters, which in turn are byte sequences.
 * It is possible to optimize the above into a set of alternatives of sequences of byte-ranges.

Implementations
for [Normalization](https://core.tcl.tk/akupries/marpa/artifact/9afbcbfde5c5c546?ln=117-244)
and [Reduction](https://core.tcl.tk/akupries/marpa/artifact/9afbcbfde5c5c546?ln=496-765).

The first simplifies literals without touching their literal-ness, i.e. the result of normalizing a literal is still a literal. The second goes further, able to break a literal apart into a collection of priority-rules representing sequences and alternates of simpler literals.

Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]).

As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them.

__Note__: The current C runtime supports only bytes and the grammar reducer targeting it breaks byte-ranges apart as well.

# Relevant references

 * Russ Cox's [Regular Expression Matching in the Wild](https://swtch.com/~rsc/regexp/regexp3.html), see <b>Step 3</b>.
 * Google's [RE2](https://github.com/google/re2) (also Russ Cox)
 * BurntSushi's Rust crate for [utf8-ranges](https://github.com/BurntSushi/utf8-ranges)
 * Lucene's [UTF-32 to UTF-8](https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java)
 * [Stefan](http://wiki.tcl.tk/44258)'s [Tcl/Regex re-implementation](https://chiselapp.com/user/stefank/repository/tclstuff/doc/trunk/www/regex.html)
 * [Generic unicode table reader (python)](https://github.com/google/re2/blob/master/re2/unicode.py)
Z 84c38feab4fddeb7d684396b931be911