Number of submatches of a regular expression

Chili development started because I found a bug at the core of Code Highlighter, and wanted to fix it. The bug was inside the snippet used to count the number of submatches of a given regular expression.

That number is central to the working of a clever parsing engine, based upon the possibility of matching once a big expression against the target, rather than matching many times smaller expressions.

The big expression is built by alternating the smaller ones, so that each of them becomes a submatch of the big expression, like in this example: (A)|(B)|(C).

But that is just the tip of the iceberg, because each of the smaller expressions can in turn have many submatches, which add up to the total number of submatches returned in an array as the result of the big match.

If a match is found, then submatches[0], which holds the global match is certainly not empty, as not empty must also be submatches[x], being x the index of the first smaller expression that matched.

In the above example, if the number of submatches of A, B, and C be always 0, then submatches[1] would be not empty if A matched, else submatches[2] would be not empty if B matched, else submatches[3] would be not empty if C matched.

But if the number of submatches of A was nA, of B nB, and of C nC, then the x for (A) would be 1, for (B) it would be (1+nA)+1, and for C it would be (1+nA)+(1+nB)+1.

Now we have all the info required for detecting the smaller expression that matched the target by looking at submatches[1], else at submatches[2+nA], or else at submatches[3+nA+nB]. The first of them which is non empty is the one that matched.

The number of open parentheses is related to the number of submatches. However it is not exactly that number, due to the following exceptions

  • parentheses are also used for temporary grouping
  • parentheses can be escaped, and considered part of the target
  • the escaping device can be escaped

Instead of trying to count the open parentheses by means of only one expression that accounts for the above exceptions, I’ve found it’s cleaner to use three separate steps.

  1. re = X.replace( /\./g, “%” )
    this removes any escaped character
  2. re = re.replace( /[.*?]/g, “%” )
    this removes any character class
  3. nX = ( re.match( /((?!?)/g ) || [] ).length
    this matches all the open parentheses not followed by a “?”

In particular:

  1. This step disables any escaped backslash or open parenthesis (as well as any other escaped character, but I don’t care). This way I’m done with the issue of escaping based on the use of the backslash sign. The X represents the regular expression under examination
  2. This step disables any open parenthesis inside any character class (as well as any other character inside any character class, but I don’t care). In fact those open parenteses could have been written without escaping them, because they are escaped by default
  3. This step is just the classical short definition of what describes a submatch in a regular expression. nX represents the number of submatches of X

One Reply to “Number of submatches of a regular expression”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.