Marpa

handling unicode
Login

handling unicode

Tcl 2016 Conference, Houston/TX, US, Nov 14-18
Send your abstracts to tclconference@googlegroups.com by Sep 12.

Up: Notes

Unicode

Engine

Grammars

Implementations for Normalization and Reduction.

The first simplifies literals without touching their literal-ness, i.e. the result of normalizing a literal is still a literal. The second goes further, able to break a literal apart into a collection of priority-rules representing sequences and alternates of simpler literals.

Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]).

As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them.

Note: The current C runtime supports only bytes and the grammar reducer targeting it breaks byte-ranges apart as well.

Relevant references