Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Artifact ID:	20f5c8ef2d2ffab257f37a1f7d7dbd86db47fbda
Page Name:	handling unicode
Date:	2017-06-27 18:18:44
Original User:	aku
Mimetype:	text/x-markdown
Parent:	2d88c6bb26e54dab973c551df9863baba966a324 (diff)
Next	ddbe4b5d463909c0c250c3b2325fbb2f9716f1e4

Content

Up: Notes

Unicode

Generally expect the input in UTF-8 encoding.
Characters are byte-sequences, 1 to 4 bytes long.

Engine

The lexer engine takes bytes as input, not characters.

Grammars

To compensate for the engine above character-based grammars are rewritten into a byte-based form where multi-byte characters are represented by rules reflecting their byte sequences.
Character classes become alternatives of characters, which in turn are byte sequences.
It is possible to optimize the above into a set of alternatives of sequences of byte-ranges.

Relevant references

Russ Cox's Regular Expression Matching in the Wild, see Step 3.
Google's RE2 (also Russ Cox)
BurntSushi's Rust crate for utf8-ranges
Lucene's UTF-32 to UTF-8
Stefan's Tcl/Regex re-implementation

Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]).

As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them.

Generic unicode table reader (python)

Marpa

Update of ”handling unicode”

Unicode

Engine

Grammars