marpa_unicode - Marpa/Tcl - marpa::unicode
Welcome to Marpa/Tcl, a Tcl binding to the "libmarpa" parsing engine.
Please read the document Marpa/Tcl - Introduction to Marpa/Tcl, if you have not done so already. It provides an overview of the whole system.
This document describes a mainly internal package of Marpa/Tcl.
The package commands provide access to information about unicode characters useful to parsers, and parser generators, i.e. case folding classes, named character classes, etc. Note for reading: Unknown terms and shorthands are explained in the Glossary.
This command takes a character class represented by an SCR and returns the equivalent ASBR representation.
An error will be thrown if the input is not a valid SCR.
This command takes a character class represented by an SCR and returns the equivalent ASSR representation.
An error will be thrown if the input is not a valid SCR.
This command takes a possible unicode codepoint and returns a list of codepoints in the BMP representing it.
The returned list will contain only the argument itself if the codepoint is in the BMP. For a codepoint in the SMP however the list will contain the two surrogates representing that codepoint.
An error will be thrown if the argument is not a valid codepoint.
% marpa::unicode::2char 65 65 % marpa::unicode::2char [marpa::unicode::smp] 55296 56320
This command takes a possible unicode codepoint and returns a list of integers in the range [0...255] representing the UTF-8 encoding of that codepoint.
The exact details of this conversion are controlled by the list of optional flags given to the command. See the list below for the details.
An error will be thrown if the input value is not a valid codepoint.
The conversion returns the standard UTF-8 encoding for all codepoints. The returned list has a length between 1 and 4.
Codepoints in the SMP are internally converted to the two surrogates representing them, and the result is the concatenation of the surrogates' UTF-8 encodings. All other codepoints, i.e. those in the BMP, are converted normally. The returned list has a length between 1 and 3, or 6. This is called the CESU-8 encoding.
The codepoint 0 is converted as 0xC0 0x80. All other codepoints are converted normally. This is called the Modified UTF-8 encoding, often abbreviated to MUTF-8.
Both MUTF-8 and CESU-8 are applied, as specified in the previous items. This is the encoding Tcl uses internally for its strings.
This is a shorthand for mutf cesu.
% marpa::unicode::2utf 0 ;# "\0" 0 % marpa::unicode::2utf 0 mutf 192 128 % marpa::unicode::2utf 65 ;# "A" 65 % marpa::unicode::2utf [marpa::unicode::bmp] ;# "\uFFFF" 239 191 191 % marpa::unicode::2utf [marpa::unicode::smp] ;# "\U0010000" 240 144 128 128 % marpa::unicode::2utf [marpa::unicode::smp] cesu 237 160 128 237 176 128 # 55296.... # 56320.... % marpa::unicode::2utf 55296 ;# "\uD800" 237 160 128 % marpa::unicode::2utf 56320 ;# "\uDC00" 237 176 128
This command takes a character class represented by an ASBR and returns a multi-line string containing a human-readable form of the same.
If the boolean argument compact is either not specified or false one-element ranges will be padded with spaces to vertically align the ranges across alternatives.
% marpa::unicode::asbr-format { {{192 192} {128 128}} {{1 16}} {{33 45}} } [c0] [80] |[01-10] |[21-2d]
This command takes a character class represented by an ASSR and returns a multi-line string containing a human-readable form of the same.
If the bolean argument compact is either not specified or false one-element ranges will be padded with spaces to vertically align the ranges across alternatives.
% marpa::unicode::assr-format { {{0 65535}} {{55296 56319} {56320 57343}} } [0000-ffff] |[d800-dbff][dc00-dfff]
This command returns the highest codepoint still in the BMP.
This command takes a unicode codepoint and returns its primary case.
An error will be thrown if the argument is not a valid codepoint.
% marpa::unicode::data::fold/c 97 ;# "a" 65 ;# "A" % marpa::unicode::data::fold/c 65 ;# "A" 65 ;# "A"
This command takes a unicode codepoint and returns its CES.
An error will be thrown if the argument is not a valid codepoint.
% marpa::unicode::data::fold 97 ;# "a" 65 97 ;# "A" "a" % marpa::unicode::data::fold 65 ;# "A" 65 97 ;# "A" "a"
This command takes a list of unicode codepoints and returns a list of their primary cases. The mapping from argument to result is 1:1.
Note that both argument and result are a limited form of SCR, i.e. one which does not contain codepoint ranges.
An error will be thrown if the argument contains invalid codepoints.
% marpa::unicode::fold/c {97 66} ;# "aB" 65 66 ;# "AB"
This command returns the highest codepoint supported by Unicode.
This command takes a character class represented by an SCR and returns its normalized complement.
If the optional flag smp is specified and true the argument is assumed to be fully in the SMP, and the resulting complement will be limited to the SMP as well. The result of applying SMP mode to classes reaching into the BMP is undefined.
An error will be thrown if the input is not a valid SCR.
% marpa::unicode::negate-class {} {0 1114111} % marpa::unicode::negate-class {{0 65535}} {65536 1114111} % marpa::unicode::negate-class {{0 1114111}} {} % marpa::unicode::negate-class 0 {1 1114111} % marpa::unicode::negate-class 1 0 {2 1114111}
This command takes a character class represented by an SCR and returns the equivalent normalized SCR. If the argument was already normalized the result will be identical to it.
An error will be thrown if the input is not a valid SCR.
% marpa::unicode::norm-class {} % marpa::unicode::norm-class {1 2 3 4} {{1 4}} % marpa unicode norm-class {{10 20} {0 15}} {0 20} % marpa unicode norm-class {10 4 3 20 0} 0 {3 4} 10 20
This command takes a Tcl character and returns the integer value of its unicode codepoint. If a multi-character string is provided to the command the result will be the conversion of the first character in that string.
% marpa::unicode::point \0 0 % marpa::unicode::point A 65 % marpa::unicode::point Apple 65
This command returns the first codepoint in the SMP.
This command takes a character class represented by an SCR and returns its case-expanded form. The result is normalized.
An error will be thrown if the argument is not valid SCR.
% marpa::unicode::unfold {66 99} {66 67} {98 99}
This command takes the possible name of a unicode character class and returns a boolean flag indicating if this class is directly supported by Tcl (true), or not (false).
% marpa::unicode::data::cc::have-tcl arabic 0 % marpa::unicode::data::cc::have-tcl xdigit 1
This command takes the possible name of a unicode character class and returns a boolean flag indicating if this is known unicode character class (true), or not (false).
% marpa::unicode::data::cc::have arabic 1 % marpa::unicode::data::cc::have foo 0 % marpa::unicode::data::cc::have xdigit 1
This command returns a list containing the names of the known unicode character classs.
Note that the package not only knows the standard named unicode character classs, but for any such C also "C:bmp" and "C:smp", which are C intersected (i.e. limited) to BMP and SMP respectively.
This command takes the name of a unicode character class and returns the SCR representing that class. The result is normalized.
An error is thrown if the argument is not a known unicode character class.
As an extension the command further accepts names of the form %foo where foo is a known unicode character class. In these cases the result is the normalized SCR of the specified character class, after case expansion.
% marpa::unicode::data::cc::ranges adlam {0x1E900 0x1E94A} {0x1E950 0x1E959} {0x1E95E 0x1E95F}
This command returns a list containing the names of the character classes directly supported by Tcl itself (via string is).
Character classes (abbreviated CC) are simply sets of unicode codepoints.
The most trivial representation would be as a list of these codepoints. Given the space such would need for larger classes this package uses a number of more compressed representations, starting with SCR (horizontal compression of ranges) to ASBR and ASSR (vertical compression of ranges in aligned byte/codepoint positions of the element representations).
Beyond the operations on such unnamed character classes just represented by values this package also knows a number of named character classes, the definition of which are extracted from the Unicode standard. These classes represent various categories of codepoints on the one hand, like alpha, blank, control, etc., and the known unicode scripts on the other, like arabic, braille, canadian_aboriginal, etc.
A set of codepoint (ranges) (short for set of codepoints and codepoint ranges, abbreviated to SCR) is the main data structure used to represent unicode character classs of any kind, named or not.
It is a Tcl list whose elements are a mix of codepoints and codepoint ranges.
The codepoints are represented by integer numbers in the range [0...marpa::unicode::max]. Numbers outside of that range are not codepoints and a list containing such is not a valid SCR.
The ranges are represented by 2-element lists (pairs) of codepoints, the start and the end of the range, inclusive. Beyond having to be valid codepoints the start must not be greater than the end of the range. Such a pair is not a valid range, and a list containing such is not a valid SCR.
A normalized SCR is defined as an SCR which contains no duplicate elements, no overlapping/adjacent ranges, and all elements are sorted in integer ascending order by their start point. Note: That previous sentence above talks only about ranges does not exclude the codepoints. For normalization and other purposes codepoints can simply be treated as ranges of size 1, where start and end points are identical.
A set of alternatives of sequences of byte-ranges (abbreviated to ASBR) is an alternative (sic!) and (usually) more compact representation of character classs.
This is the go-to representation of character classs for the byte-oriented engine provided by marpa::runtime::c, as it can be directly mapped to the grammar rules for such.
It makes use of the fact that the UTF-8 encoding of unicode codepoints maps each codepoint to a sequence of bytes, and then compresses the resulting "alternatives of byte sequences" by merging adjacent alternatives differing in a single position into a new alternative where the differing bytes become a single byte range (this is where "adjacent" is critical).
Such a compression can be done very efficiently (in time and space) by generating the alternatives in sorted order and then comparing the newly generated against the last processed.
A set of alternatives of sequences of surrogate-ranges (abbreviated to ASSR) is an alternative (sic!) and (usually) more compact representation of character classs.
The basic principle of structure and generation is the same as for ASBR. In contrast however the elements of the alternatives are codepoints limited to the BMP, with SMP characters in the argument represented as pairs of surrogates before compression.
This is the go-to representation of character classs for the character-oriented engine provided by marpa::runtime::tcl, due to its Tcl-imposed limitation to the BMP, as it can be directly mapped to the grammar rules for such.
The main references needed are https://en.wikipedia.org/wiki/Unicode and https://en.wikipedia.org/wiki/UTF-8.
The Basic Multi-lingual Plane of unicode.
The BMP runs from codepoint 0 to codepoint marpa::unicode::bmp, inclusive.
Characters in the BMP can be encoded by UTF-8 into a sequence of at least one to most three bytes.
The Supplemental Multi-lingual Planes of unicode. Starting just after the BMP they cover the remainder of unicode.
The SMP runs from codepoint marpa::unicode::smp to marpa::unicode::max, inclusive.
Characters in the SMP can be encoded by UTF-8 into a sequence of 4 bytes. For systems limited to the BMP (like Tcl) characters can also encoded as a pair of BMP surrogate characters. This allows encoding them in six bytes of a pseudo UTF-8 encoding. This is called CESU-8 coding. (See also the description of the flags for marpa::unicode::2utf).
Shorthand for case-equivalent set.
Each unicode codepoint C is associated with a set of codepoints representing the different cases of the same character. These are the case-equivalent codepoints of C. C itself is of course a member of this set.
As an example, the CES of 97 is {65, 97}. More human readable, the character "A" has variants "a" and "A", lower- and upper-case.
As a second example, codepoint 38, i.e. "&", is a codepoint whose CES contains only itself.
Note that a CES is a character class.
For each each unicode codepoint the smallest-numbered codepoint in its CES is called this.
This is the process of replacing all codepoints of a larger structure with its CES.
A string of characters becomes a sequence of character classs, making it case-independent.
A character class becomes a possibly larger character class.
This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such at the Marpa/Tcl Tracker. Please report any ideas for enhancements you may have for either package and/or documentation as well.
aycock, basic multi-lingual plane, bmp, case expansion, cesu-8, character classes, class canonicalisation, class complement, class negation, class normalization, code point, context free grammar, document processing, earley, horspool, joop leo, lexing, libmarpa, mutf-8, nigel horspool, parsing, primary case, regex, smp, supplemental multi-lingual planes, surrogate, table parsing, unicode, utf-8
Copyright © 2015-present Andreas Kupries
Copyright © 2018-present Documentation, Andreas Kupries