This module provides an alternative regular expression facility which, unlike Tcl's built-in one, does not convert strings to Unicode in order to match. I implemented this matcher as part of my Tcl Unicode reorganization. NOTE: see section "Restrictions" below.
The best source that I've come across for writing regular expressions matchers is Russ Cox's fantastic series of articles. The implementation here closely follows his guides.
One point worth pointing out that Cox only alludes to in his articles is the recursive behavior of addthread (called "follow" in my code). He notes that, in order to schedule threads in the correct order for the next round to prioritize SPLIT's first destination, addthread should process jumps recursively and add all immediately reachable threads eagerly.
In practice, if you read his code you discover that addthread also processes SAVE. This is because a jump might occur after the SAVE, and if so, that jump destination should also be scheduled immediately.
I've generalized this notion to make a clearer (to me) split between the "execute" and "addthread" (renamed "follow") loops. Now follow will always advance as far as possible, even handling MATCH instructions, until it becomes clear that it needs to read another character from the matched string, at which point it cedes control back to execute.
This leaves only three instructions for the execute loop itself: CHR, ANY, and BRACKET (the last one not implemented in Cox's sample code). Before reading the first character, and after that, after reading every single character, it calls follow to advance the match as far as possible without reading input.
Another difference is that execute also informs follow whether it has reached the end of the input. This way follow can process the END instruction, and also drop a thread when no match has been found rather than return to execute, upon end of input.
This division emphasizes the principle that the threads run in lockstep in regards to consuming characters; while looping through the thread list, the only decision to make is whether to continue the thread or drop, given the current character. All logic to process individual threads beyond that is handled by follow.
The invariant that every thread scheduled will be ready to consume a character also simplifies the code a little.
This regex implementation is split into two parts. The front end, in Tcl, parses (using combinators) and compiles regexes to a compact (especially compared to Cox's sample code) bytecode representation. The backend, written in C, only executes bytecode.
The compiler procedure is memoized so compilation only occurs once for each static occurence of a regular expression.
Before execution, the C engine will verify that the bytecode is safe to run—for eample, that it won't cause access to out-of-bounds memory. After validation the bytecode along with some summary information useful for execution (number of instructions, maximum slot value, etc.) will be stored in the internal representation of the object.
Whereas Cox's compiler directly outputs code in one pass, patching jump addresses as soon as they become available, this regex first converts the AST to an intermediate representation silimar to continuation-passing style. The regex compiler in Tcl then performs a couple of simple optimizations. First, jumps to jumps are eliminated. Then, it will try to combine basic blocks into "traces"—sorting basic blocks in such a way that a jump is immediately followed by its label so that the jump can be deleted.
Another (tiny) optimization is that, for (re)*, rather than generating a SPLIT instruction and then looping back to the first SPLIT, we duplicate the SPLIT.
This is only a proof-of-concept implementation, not production quality. The following features are missing:
In addition, this regex will currently not run with the default 16-bit Tcl_UniChar (see here for an explanation), though it would be easy to port.
C source: regex.c.txt, Tcl source: regex.tcl.txt. You'll also need util.tcl.txtand tcl_parser.tcl.txt