Number of submatches of a regular expression

Chili development started because I found a bug at the core of Code Highlighter, and wanted to fix it. The bug was inside the snippet used to count the number of submatches of a given regular expression.

That number is central to the working of a clever parsing engine, based upon the possibility of matching once a big expression against the target, rather than matching many times smaller expressions.

The big expression is built by alternating the smaller ones, so that each of them becomes a submatch of the big expression, like in this example: (A)|(B)|(C).

But that is just the tip of the iceberg, because each of the smaller expressions can in turn have many submatches, which add up to the total number of submatches returned in an array as the result of the big match.

If a match is found, then submatches[0], which holds the global match is certainly not empty, as not empty must also be submatches[x], being x the index of the first smaller expression that matched.

In the above example, if the number of submatches of A, B, and C be always 0, then submatches[1] would be not empty if A matched, else submatches[2] would be not empty if B matched, else submatches[3] would be not empty if C matched.

But if the number of submatches of A was nA, of B nB, and of C nC, then the x for (A) would be 1, for (B) it would be (1+nA)+1, and for C it would be (1+nA)+(1+nB)+1.

Now we have all the info required for detecting the smaller expression that matched the target by looking at submatches[1], else at submatches[2+nA], or else at submatches[3+nA+nB]. The first of them which is non empty is the one that matched.

The number of open parentheses is related to the number of submatches. However it is not exactly that number, due to the following exceptions

  • parentheses are also used for temporary grouping
  • parentheses can be escaped, and considered part of the target
  • the escaping device can be escaped

Instead of trying to count the open parentheses by means of only one expression that accounts for the above exceptions, I’ve found it’s cleaner to use three separate steps.

  1. re = X.replace( /\./g, “%” )
    this removes any escaped character
  2. re = re.replace( /[.*?]/g, “%” )
    this removes any character class
  3. nX = ( re.match( /((?!?)/g ) || [] ).length
    this matches all the open parentheses not followed by a “?”

In particular:

  1. This step disables any escaped backslash or open parenthesis (as well as any other escaped character, but I don’t care). This way I’m done with the issue of escaping based on the use of the backslash sign. The X represents the regular expression under examination
  2. This step disables any open parenthesis inside any character class (as well as any other character inside any character class, but I don’t care). In fact those open parenteses could have been written without escaping them, because they are escaped by default
  3. This step is just the classical short definition of what describes a submatch in a regular expression. nX represents the number of submatches of X

A JavaScript CRC32

Here is a micro JavaScript library for computing the CRC32 of a string.

After importing the crc32.js file, a function with this signature

will be available in the global scope.

If you supply only the str (which is mandatory), it returns the CRC32 code of that string (using 0 as a start, which is a de facto standard), and if you also supply the crc, it will use that number as a start. This comes in handy for chaining crc32 calls, maybe in a loop, in fact:

A JavaScript heredoc

JavaScript lacks PHP’s concept of heredoc:

Another way to delimit strings is by using heredoc syntax (<<<). One should provide an identifier after <<<, then the string, and then the same identifier to close the quotation.

Here is a little function that can be used to emulate it. But it only works in IE (sigh). If you know how to do it also in Firefox, I’ll be very glad to update this post.

Really it should be called herescript, because fn (the mandatory argument) is a function declaration, so the text must be syntactically correct JavaScript code, not just plain text. This feature is very convenient when I want to defer the execution of some statements while having them highlighted by my editor, but it bothers a little when I want just a plain text.

To solve this, the function also accepts the optional arguments from and top. If I supply two numbers, the heredoc text will be all the lines between from included and top excluded. In this case I can supply also an optional separator to put in between lines. If I supply two strings, the heredoc text will be all the characters between from and top, both excluded.

myHereDoc1: when the content is a script

myHereDoc2: when the content is text, delimited by line number

myHereDoc3: when the content is text, delimited by substrings

To try all these examples together, we can write this: {[.example-standard /enzymes/chili-js.php]}

to get this:

But cha-chaaa, here goes a bonus. Using a heredoc like this:

the same result is achieved, but the javascript code is much cleaner: