Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Artifact ID:	a6c2e0580ec0298807b4c908ad29f9fdee521a25
Page Name:	handling unicode
Date:	2017-10-11 23:01:13
Original User:	aku
Mimetype:	text/x-markdown
Parent:	ddbe4b5d463909c0c250c3b2325fbb2f9716f1e4 (diff)

Content

Up: Notes

Unicode

Generally expect the input in UTF-8 encoding.
Characters are byte-sequences, 1 to 4 bytes long.

Engine

The lexer engine takes bytes as input, not characters.

Grammars

To compensate for the engine above character-based grammars are rewritten into a byte-based form where multi-byte characters are represented by rules reflecting their byte sequences.
Character classes become alternatives of characters, which in turn are byte sequences.
It is possible to optimize the above into a set of alternatives of sequences of byte-ranges.

Implementations for Normalization and Reduction.

The first simplifies literals without touching their literal-ness, i.e. the result of normalizing a literal is still a literal. The second goes further, able to break a literal apart into a collection of priority-rules representing sequences and alternates of simpler literals.

Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]).

As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them.

Note: The current C runtime supports only bytes and the grammar reducer targeting it breaks byte-ranges apart as well.

Relevant references

Russ Cox's Regular Expression Matching in the Wild, see Step 3.
Google's RE2 (also Russ Cox)
BurntSushi's Rust crate for utf8-ranges
Lucene's UTF-32 to UTF-8
Stefan's Tcl/Regex re-implementation
Generic unicode table reader (python)

Marpa

Update of ”handling unicode”

Unicode

Engine

Grammars

Relevant references