[ Home | Main Table Of Contents | Table Of Contents | Keyword Index ]

marpa_unicode(n) 1 doc "Marpa/Tcl, a binding to libmarpa"

Name

marpa_unicode - Marpa/Tcl - marpa::unicode

Table Of Contents

Synopsis

Description

Welcome to Marpa/Tcl, a Tcl binding to the "libmarpa" parsing engine.

Please read the document Marpa/Tcl - Introduction to Marpa/Tcl, if you have not done so already. It provides an overview of the whole system.

Audience

This document describes a mainly internal package of Marpa/Tcl.

The package commands provide access to information about unicode characters useful to parsers, and parser generators, i.e. case folding classes, named character classes, etc. Note for reading: Unknown terms and shorthands are explained in the Glossary.

API

marpa::unicode::2asbr scr

This command takes a character class represented by an SCR and returns the equivalent ASBR representation.

An error will be thrown if the input is not a valid SCR.

marpa::unicode::2assr scr

This command takes a character class represented by an SCR and returns the equivalent ASSR representation.

An error will be thrown if the input is not a valid SCR.

marpa::unicode::2char codepoint

This command takes a possible unicode codepoint and returns a list of codepoints in the BMP representing it.

The returned list will contain only the argument itself if the codepoint is in the BMP. For a codepoint in the SMP however the list will contain the two surrogates representing that codepoint.

An error will be thrown if the argument is not a valid codepoint.

% marpa::unicode::2char 65
65
% marpa::unicode::2char [marpa::unicode::smp]
55296 56320
marpa::unicode::2utf codepoint ?flags?

This command takes a possible unicode codepoint and returns a list of integers in the range [0...255] representing the UTF-8 encoding of that codepoint.

The exact details of this conversion are controlled by the list of optional flags given to the command. See the list below for the details.

An error will be thrown if the input value is not a valid codepoint.

(no flags)

The conversion returns the standard UTF-8 encoding for all codepoints. The returned list has a length between 1 and 4.

cesu

Codepoints in the SMP are internally converted to the two surrogates representing them, and the result is the concatenation of the surrogates' UTF-8 encodings. All other codepoints, i.e. those in the BMP, are converted normally. The returned list has a length between 1 and 3, or 6. This is called the CESU-8 encoding.

mutf

The codepoint 0 is converted as 0xC0 0x80. All other codepoints are converted normally. This is called the Modified UTF-8 encoding, often abbreviated to MUTF-8.

mutf cesu

Both MUTF-8 and CESU-8 are applied, as specified in the previous items. This is the encoding Tcl uses internally for its strings.

tcl

This is a shorthand for mutf cesu.

% marpa::unicode::2utf 0 ;# "\0"
0
% marpa::unicode::2utf 0 mutf
192 128
% marpa::unicode::2utf 65 ;# "A"
65
% marpa::unicode::2utf [marpa::unicode::bmp] ;# "\uFFFF"
239 191 191
% marpa::unicode::2utf [marpa::unicode::smp] ;# "\U0010000"
240 144 128 128
% marpa::unicode::2utf [marpa::unicode::smp] cesu
237 160 128 237 176 128
# 55296.... # 56320....
% marpa::unicode::2utf 55296 ;# "\uD800"
237 160 128
% marpa::unicode::2utf 56320 ;# "\uDC00"
237 176 128
marpa::unicode::asbr-format asbr ?compact?

This command takes a character class represented by an ASBR and returns a multi-line string containing a human-readable form of the same.

If the boolean argument compact is either not specified or false one-element ranges will be padded with spaces to vertically align the ranges across alternatives.

% marpa::unicode::asbr-format {
    {{192 192} {128 128}}
    {{1 16}} {{33 45}}
}
[c0]   [80]
|[01-10]
|[21-2d]
marpa::unicode::assr-format assr ?compact?

This command takes a character class represented by an ASSR and returns a multi-line string containing a human-readable form of the same.

If the bolean argument compact is either not specified or false one-element ranges will be padded with spaces to vertically align the ranges across alternatives.

% marpa::unicode::assr-format {
    {{0 65535}}
    {{55296 56319} {56320 57343}}
}
 [0000-ffff]
|[d800-dbff][dc00-dfff]
marpa::unicode::bmp

This command returns the highest codepoint still in the BMP.

marpa::unicode::data::fold/c codepoint

This command takes a unicode codepoint and returns its primary case.

An error will be thrown if the argument is not a valid codepoint.

% marpa::unicode::data::fold/c 97 ;# "a"
65                           ;# "A"
% marpa::unicode::data::fold/c 65 ;# "A"
65                           ;# "A"
marpa::unicode::data::fold codepoint

This command takes a unicode codepoint and returns its CES.

An error will be thrown if the argument is not a valid codepoint.

% marpa::unicode::data::fold 97 ;# "a"
65 97                      ;# "A" "a"
% marpa::unicode::data::fold 65 ;# "A"
65 97                      ;# "A" "a"
marpa::unicode::fold/c codes

This command takes a list of unicode codepoints and returns a list of their primary cases. The mapping from argument to result is 1:1.

Note that both argument and result are a limited form of SCR, i.e. one which does not contain codepoint ranges.

An error will be thrown if the argument contains invalid codepoints.

% marpa::unicode::fold/c {97 66} ;# "aB"
65 66                            ;# "AB"
marpa::unicode::max

This command returns the highest codepoint supported by Unicode.

marpa::unicode::negate-class scr ?smp?

This command takes a character class represented by an SCR and returns its normalized complement.

If the optional flag smp is specified and true the argument is assumed to be fully in the SMP, and the resulting complement will be limited to the SMP as well. The result of applying SMP mode to classes reaching into the BMP is undefined.

An error will be thrown if the input is not a valid SCR.

% marpa::unicode::negate-class {}
{0 1114111}
% marpa::unicode::negate-class {{0 65535}}
{65536 1114111}
% marpa::unicode::negate-class {{0 1114111}}
{}
% marpa::unicode::negate-class 0
{1 1114111}
% marpa::unicode::negate-class 1
0 {2 1114111}
marpa::unicode::norm-class scr

This command takes a character class represented by an SCR and returns the equivalent normalized SCR. If the argument was already normalized the result will be identical to it.

An error will be thrown if the input is not a valid SCR.

% marpa::unicode::norm-class {}
% marpa::unicode::norm-class {1 2 3 4}
{{1 4}}
% marpa unicode norm-class {{10 20} {0 15}}
{0 20}
% marpa unicode norm-class {10 4 3 20 0}
0 {3 4} 10 20
marpa::unicode::point character

This command takes a Tcl character and returns the integer value of its unicode codepoint. If a multi-character string is provided to the command the result will be the conversion of the first character in that string.

% marpa::unicode::point \0
0
% marpa::unicode::point A
65
% marpa::unicode::point Apple
65
marpa::unicode::smp

This command returns the first codepoint in the SMP.

marpa::unicode::unfold scr

This command takes a character class represented by an SCR and returns its case-expanded form. The result is normalized.

An error will be thrown if the argument is not valid SCR.

% marpa::unicode::unfold {66 99}
{66 67} {98 99}
marpa::unicode::data::cc::have-tcl ccname

This command takes the possible name of a unicode character class and returns a boolean flag indicating if this class is directly supported by Tcl (true), or not (false).

% marpa::unicode::data::cc::have-tcl arabic
0
% marpa::unicode::data::cc::have-tcl xdigit
1
marpa::unicode::data::cc::have ccname

This command takes the possible name of a unicode character class and returns a boolean flag indicating if this is known unicode character class (true), or not (false).

% marpa::unicode::data::cc::have arabic
1
% marpa::unicode::data::cc::have foo
0
% marpa::unicode::data::cc::have xdigit
1
marpa::unicode::data::cc::names

This command returns a list containing the names of the known unicode character classs.

Note that the package not only knows the standard named unicode character classs, but for any such C also "C:bmp" and "C:smp", which are C intersected (i.e. limited) to BMP and SMP respectively.

marpa::unicode::data::cc::ranges ccname

This command takes the name of a unicode character class and returns the SCR representing that class. The result is normalized.

An error is thrown if the argument is not a known unicode character class.

As an extension the command further accepts names of the form %foo where foo is a known unicode character class. In these cases the result is the normalized SCR of the specified character class, after case expansion.

% marpa::unicode::data::cc::ranges adlam
{0x1E900 0x1E94A} {0x1E950 0x1E959} {0x1E95E 0x1E95F}
marpa::unicode::data::cc::tcl-names

This command returns a list containing the names of the character classes directly supported by Tcl itself (via string is).

Datastructures

character Class - CC

Character classes (abbreviated CC) are simply sets of unicode codepoints.

The most trivial representation would be as a list of these codepoints. Given the space such would need for larger classes this package uses a number of more compressed representations, starting with SCR (horizontal compression of ranges) to ASBR and ASSR (vertical compression of ranges in aligned byte/codepoint positions of the element representations).

Beyond the operations on such unnamed character classes just represented by values this package also knows a number of named character classes, the definition of which are extracted from the Unicode standard. These classes represent various categories of codepoints on the one hand, like alpha, blank, control, etc., and the known unicode scripts on the other, like arabic, braille, canadian_aboriginal, etc.

Set of Codepoint (Ranges) - SCR

A set of codepoint (ranges) (short for set of codepoints and codepoint ranges, abbreviated to SCR) is the main data structure used to represent unicode character classs of any kind, named or not.

It is a Tcl list whose elements are a mix of codepoints and codepoint ranges.

The codepoints are represented by integer numbers in the range [0...marpa::unicode::max]. Numbers outside of that range are not codepoints and a list containing such is not a valid SCR.

The ranges are represented by 2-element lists (pairs) of codepoints, the start and the end of the range, inclusive. Beyond having to be valid codepoints the start must not be greater than the end of the range. Such a pair is not a valid range, and a list containing such is not a valid SCR.

A normalized SCR is defined as an SCR which contains no duplicate elements, no overlapping/adjacent ranges, and all elements are sorted in integer ascending order by their start point. Note: That previous sentence above talks only about ranges does not exclude the codepoints. For normalization and other purposes codepoints can simply be treated as ranges of size 1, where start and end points are identical.

Alternatives of Sequences of Byte-Ranges - ASBR

A set of alternatives of sequences of byte-ranges (abbreviated to ASBR) is an alternative (sic!) and (usually) more compact representation of character classs.

This is the go-to representation of character classs for the byte-oriented engine provided by marpa::runtime::c, as it can be directly mapped to the grammar rules for such.

It makes use of the fact that the UTF-8 encoding of unicode codepoints maps each codepoint to a sequence of bytes, and then compresses the resulting "alternatives of byte sequences" by merging adjacent alternatives differing in a single position into a new alternative where the differing bytes become a single byte range (this is where "adjacent" is critical).

Such a compression can be done very efficiently (in time and space) by generating the alternatives in sorted order and then comparing the newly generated against the last processed.

Alternatives of Sequences of Surrogate-Ranges - ASSR

A set of alternatives of sequences of surrogate-ranges (abbreviated to ASSR) is an alternative (sic!) and (usually) more compact representation of character classs.

The basic principle of structure and generation is the same as for ASBR. In contrast however the elements of the alternatives are codepoints limited to the BMP, with SMP characters in the argument represented as pairs of surrogates before compression.

This is the go-to representation of character classs for the character-oriented engine provided by marpa::runtime::tcl, due to its Tcl-imposed limitation to the BMP, as it can be directly mapped to the grammar rules for such.

Glossary

Unicode

The main references needed are https://en.wikipedia.org/wiki/Unicode and https://en.wikipedia.org/wiki/UTF-8.

BMP

The Basic Multi-lingual Plane of unicode.

The BMP runs from codepoint 0 to codepoint marpa::unicode::bmp, inclusive.

Characters in the BMP can be encoded by UTF-8 into a sequence of at least one to most three bytes.

SMP

The Supplemental Multi-lingual Planes of unicode. Starting just after the BMP they cover the remainder of unicode.

The SMP runs from codepoint marpa::unicode::smp to marpa::unicode::max, inclusive.

Characters in the SMP can be encoded by UTF-8 into a sequence of 4 bytes. For systems limited to the BMP (like Tcl) characters can also encoded as a pair of BMP surrogate characters. This allows encoding them in six bytes of a pseudo UTF-8 encoding. This is called CESU-8 coding. (See also the description of the flags for marpa::unicode::2utf).

CES

Shorthand for case-equivalent set.

case-equivalent set

Each unicode codepoint C is associated with a set of codepoints representing the different cases of the same character. These are the case-equivalent codepoints of C. C itself is of course a member of this set.

As an example, the CES of 97 is {65, 97}. More human readable, the character "A" has variants "a" and "A", lower- and upper-case.

As a second example, codepoint 38, i.e. "&", is a codepoint whose CES contains only itself.

Note that a CES is a character class.

primary case

For each each unicode codepoint the smallest-numbered codepoint in its CES is called this.

case expansion

This is the process of replacing all codepoints of a larger structure with its CES.

A string of characters becomes a sequence of character classs, making it case-independent.

A character class becomes a possibly larger character class.

Bugs, Ideas, Feedback

This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such at the Marpa/Tcl Tracker. Please report any ideas for enhancements you may have for either package and/or documentation as well.

Keywords

aycock, basic multi-lingual plane, bmp, case expansion, cesu-8, character classes, class canonicalisation, class complement, class negation, class normalization, code point, context free grammar, document processing, earley, horspool, joop leo, lexing, libmarpa, mutf-8, nigel horspool, parsing, primary case, regex, smp, supplemental multi-lingual planes, surrogate, table parsing, unicode, utf-8