How to count expected matches of a PHP regular expression

Counting expected matches of a PHP regular expression must consider 4 different ways of capturing groups, all possible at the same time: numbered, named, duplicate numbers and duplicate names.

Counting Numbered Groups

Numbered groups of a regular expression are introduced by the pattern ´(regex)´.

Subpatterns are delimited by parentheses (round brackets), which can be nested.

Opening parentheses are counted from left to right (starting from 1) to obtain numbers for the capturing subpatterns.

If an opening parenthesis is followed by a question mark and a colon, the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( that introduces a conditional subpattern.

Example:

the ((?:red|white) (king|queen))

Of course escaped open parentheses are not to be counted. And open parentheses inside character classes (where escaping is automatic) are not to be counted either. To simplify, instead of one regular expression I use three, in such a specific order that the next regular expression takes advantage of the previous one.

{[ .numbered-groups | 1.hilite(=php=) ]}

Explicitly and implicitly escaped characters are erased all from the pattern, instead of only open parentheses, by replacing them with a % char. This is not a problem and actually reduces intricacies of the regular expression while preserving order and groups. Erasing explicitly escaped characters before implicitly escaped ones allows the pattern for finding the latter to ignore any escaped closing bracket.

Counting Named Groups

Named groups of a regular expression are introduced by the patterns ´(?<name>regex)´, ´(?’name’regex)´, or ´(?P<name>regex)´.

PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers.

Example:

(?<date>(?<year>(dd)?dd)-(?<month>dd)-(?<day>dd))

This is quite simple.

{[ .named-groups | 1.hilite(=php=) ]}

Counting Groups with Duplicate Names

Duplicate names in a regular expression are introduced by the pattern ´(?J:regex)´ or ´(?J)regex´.

By default, a name must be unique within a pattern, but it is possible to relax this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate names are also always permitted for subpatterns with the same number, set up as described in the previous section.)

Example:

(?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)

Duplicate names are easy to spot by looping through the matches returned in the ´$named_groups´ array after matching the ´$find_named_groups´ pattern with the flags ´PREG_SET_ORDER | PREG_OFFSET_CAPTURE´.

{[ .duplicate-names | 1.hilite(=php=) ]}

Counting Groups with Duplicate Numbers

Duplicate numbers in a regular expression are introduced by the pattern ´(?|regex)´.

Perl 5.10 introduced a feature whereby each alternative in a subpattern uses the same numbers for its capturing parentheses.

Inside a ´(?|´ group, parentheses are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing parentheses that follow the subpattern start after the highest number used in any branch.

Example:

(?|(Sat)ur|(Sun))day

Groups with duplicate numbers (I’ll call them hellternations, for brevity) can also be nested and used together with any other available grouping. This is a real hell, mostly because of the branch reset.

For this reason, I now separate all previous groups counting from this one, into the following nice function. (I’ll call it CGIH, for brevity)

{[ .count-ignoring-hellternations | 1.hilite(=php=) ]}

If a pattern does not contain any hellternation, then CGIH will give the correct result. On the contrary, I’m going to apply it anyway, because I also need to count any groups outside of any hellternation.

In the general case, there will be non-hellternations as as well as hellternations, all of them sibling to each other at the same level. In that case I can apply CGIH to the current pattern, and later adjust that count for each hellternation. First I have to subtract the contribution (1) of the hellternation to the total, and then add the maximum count (2) of all of its alternatives. The number (1) is got by applying CGIH to the hellternation, and the number (2) by recursively applying all I said in this paragraph to each alternative.

{[ .count | 1.hilite(=php=) ]}

Utilities

The previous function is supported by the following utilities which both operate by counting balanced parentheses in a regular expression.

{[ .utilities | 1.hilite(=php=) ]}

Minimal Test

{[ .test-program | 1.hilite(=php=) ]}

type  = numbered
regex = the ((?:red|white) (king|queen))
count = 2

type  = named
regex = (?<date>(?<year>(dd)?dd)-(?<month>dd)-(?<day>dd))
count = 9

type  = duplicate names
regex = (?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)
count = 6

type  = duplicate numbers
regex = (?|(Sat)ur|(Sun))day
count = 1

 

CategoriesUncategorizedTags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.