Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Artifact ID: | 20f5c8ef2d2ffab257f37a1f7d7dbd86db47fbda |
---|---|
Page Name: | handling unicode |
Date: | 2017-06-27 18:18:44 |
Original User: | aku |
Mimetype: | text/x-markdown |
Parent: | 2d88c6bb26e54dab973c551df9863baba966a324 (diff) |
Next | ddbe4b5d463909c0c250c3b2325fbb2f9716f1e4 |
Content
Up: Notes
Unicode
- Generally expect the input in UTF-8 encoding.
- Characters are byte-sequences, 1 to 4 bytes long.
Engine
- The lexer engine takes bytes as input, not characters.
Grammars
- To compensate for the engine above character-based grammars are rewritten into a byte-based form where multi-byte characters are represented by rules reflecting their byte sequences.
- Character classes become alternatives of characters, which in turn are byte sequences.
- It is possible to optimize the above into a set of alternatives of sequences of byte-ranges.
Relevant references
- Russ Cox's Regular Expression Matching in the Wild, see Step 3.
- Google's RE2 (also Russ Cox)
- BurntSushi's Rust crate for utf8-ranges
- Lucene's UTF-32 to UTF-8
- Stefan's Tcl/Regex re-implementation
Regardless, at the bottom the engine has to support only bytes and byte-ranges, or even only bytes, with the ranges rewritten into alternations. (Finite, at most 256 for a full range [00-ff]).
As a side effect we can support the full range of unicode character classes, despite Tcl itself not supporting them.