How to fix a preg_match bug

This is the second time I try to fix the preg_match bug with a number 50887. The first time I tried it, less than two months ago, my solution was buggy too, and for this reason I have unpublished it.

There are many different scenarios that I didn’t take into account, but almost all of them were related to How to count expected matches of a PHP regular expression. The rest were essentially the same: I thought that the preg_match bug was affecting just the last optional match. Later I found instead that it affects really all the matches at the end, any number of them, be they optional or not. Due to this little misunderstanding I had to rewrite a big portion of the function.

{[ .fix | 1.hilite(=php=) ]}

For the function ando_preg_count_groups(…) see the post How to count expected matches of a PHP regular expression.

Minimal Tests

The following tests are based on $regex = ‘(?J:(?Mon|Fri|Sun)(?:day)?|(?Tue)(?:sday)?|(?Wed)(?:nesday)?|(?Thu)(?:rsday)?|(?Sat)(?:urday)?)’;

.

Test 1a

{[ .test-1a-program | 1.hilite(=php=) ]}

In the next comparison we see the same result, and it really is, because
the count of expected matches equals the count of returned matches.
Array
(
    [0] => Saturday
    [DN] => Sat
    [1] => 
    [2] => 
    [3] => 
    [4] => 
    [5] => Sat
)
Array
(
    [0] => Saturday
    [DN] => Sat
    [1] => 
    [2] => 
    [3] => 
    [4] => 
    [5] => Sat
)
Test 1b

{[ .test-1b-program | 1.hilite(=php=) ]}

In the next comparison we see above that the bug applies to all the matches at the end,
and below that my fix correctly returns all expected matches, including the empty ones.
Array
(
    [0] => Tuesday
    [DN] => Tue
    [1] => 
    [2] => Tue
)
Array
(
    [0] => Tuesday
    [DN] => Tue
    [1] => 
    [2] => Tue
    [3] => 
    [4] => 
    [5] => 
)
Test 1c

{[ .test-1c-program | 1.hilite(=php=) ]}

In the next comparison we see above the bug at its extreme,
and below that my fix works as well as in the other cases.
Array
(
    [0] => 
)
Array
(
    [0] => 
    [DN] => 
    [1] => 
    [2] => 
    [3] => 
    [4] => 
    [5] => 
)

The following tests are based on $regex = ‘(?|Saturday|Sun(day)?)’;.

Test 2a

{[ .test-2a-program | 1.hilite(=php=) ]}

Array
(
    [0] => Sun
)
Array
(
    [0] => Sun
    [1] => 
)
Test 2b

{[ .test-2b-program | 1.hilite(=php=) ]}

Array
(
    [0] => Sunday
    [1] => day
)
Array
(
    [0] => Sunday
    [1] => day
)
Test 2c

{[ .test-2c-program | 1.hilite(=php=) ]}

Array
(
    [0] => Saturday
)
Array
(
    [0] => Saturday
    [1] => 
)

The following tests are based on $regex = ‘(?|(Sat)ur(day)|Sun(day)?)’;.

Test 3a

{[ .test-2a-program | 1.hilite(=php=) ]}

Array
(
    [0] => Sun
)
Array
(
    [0] => Sun
    [1] => 
    [2] => 
)
Test 3b

{[ .test-2b-program | 1.hilite(=php=) ]}

Array
(
    [0] => Sunday
    [1] => day
)
Array
(
    [0] => Sunday
    [1] => day
    [2] => 
)
Test 3c

{[ .test-2c-program | 1.hilite(=php=) ]}

Array
(
    [0] => Saturday
    [1] => Sat
    [2] => day
)
Array
(
    [0] => Saturday
    [1] => Sat
    [2] => day
)

How to count expected matches of a PHP regular expression

Counting expected matches of a PHP regular expression must consider 4 different ways of capturing groups, all possible at the same time: numbered, named, duplicate numbers and duplicate names.

Counting Numbered Groups

Numbered groups of a regular expression are introduced by the pattern ´(regex)´.

Subpatterns are delimited by parentheses (round brackets), which can be nested.

Opening parentheses are counted from left to right (starting from 1) to obtain numbers for the capturing subpatterns.

If an opening parenthesis is followed by a question mark and a colon, the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( that introduces a conditional subpattern.

Example:

the ((?:red|white) (king|queen))

Of course escaped open parentheses are not to be counted. And open parentheses inside character classes (where escaping is automatic) are not to be counted either. To simplify, instead of one regular expression I use three, in such a specific order that the next regular expression takes advantage of the previous one.

{[ .numbered-groups | 1.hilite(=php=) ]}

Explicitly and implicitly escaped characters are erased all from the pattern, instead of only open parentheses, by replacing them with a % char. This is not a problem and actually reduces intricacies of the regular expression while preserving order and groups. Erasing explicitly escaped characters before implicitly escaped ones allows the pattern for finding the latter to ignore any escaped closing bracket.

Counting Named Groups

Named groups of a regular expression are introduced by the patterns ´(?<name>regex)´, ´(?’name’regex)´, or ´(?P<name>regex)´.

PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers.

Example:

(?<date>(?<year>(dd)?dd)-(?<month>dd)-(?<day>dd))

This is quite simple.

{[ .named-groups | 1.hilite(=php=) ]}

Counting Groups with Duplicate Names

Duplicate names in a regular expression are introduced by the pattern ´(?J:regex)´ or ´(?J)regex´.

By default, a name must be unique within a pattern, but it is possible to relax this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate names are also always permitted for subpatterns with the same number, set up as described in the previous section.)

Example:

(?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)

Duplicate names are easy to spot by looping through the matches returned in the ´$named_groups´ array after matching the ´$find_named_groups´ pattern with the flags ´PREG_SET_ORDER | PREG_OFFSET_CAPTURE´.

{[ .duplicate-names | 1.hilite(=php=) ]}

Counting Groups with Duplicate Numbers

Duplicate numbers in a regular expression are introduced by the pattern ´(?|regex)´.

Perl 5.10 introduced a feature whereby each alternative in a subpattern uses the same numbers for its capturing parentheses.

Inside a ´(?|´ group, parentheses are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing parentheses that follow the subpattern start after the highest number used in any branch.

Example:

(?|(Sat)ur|(Sun))day

Groups with duplicate numbers (I’ll call them hellternations, for brevity) can also be nested and used together with any other available grouping. This is a real hell, mostly because of the branch reset.

For this reason, I now separate all previous groups counting from this one, into the following nice function. (I’ll call it CGIH, for brevity)

{[ .count-ignoring-hellternations | 1.hilite(=php=) ]}

If a pattern does not contain any hellternation, then CGIH will give the correct result. On the contrary, I’m going to apply it anyway, because I also need to count any groups outside of any hellternation.

In the general case, there will be non-hellternations as as well as hellternations, all of them sibling to each other at the same level. In that case I can apply CGIH to the current pattern, and later adjust that count for each hellternation. First I have to subtract the contribution (1) of the hellternation to the total, and then add the maximum count (2) of all of its alternatives. The number (1) is got by applying CGIH to the hellternation, and the number (2) by recursively applying all I said in this paragraph to each alternative.

{[ .count | 1.hilite(=php=) ]}

Utilities

The previous function is supported by the following utilities which both operate by counting balanced parentheses in a regular expression.

{[ .utilities | 1.hilite(=php=) ]}

Minimal Test

{[ .test-program | 1.hilite(=php=) ]}

type  = numbered
regex = the ((?:red|white) (king|queen))
count = 2

type  = named
regex = (?<date>(?<year>(dd)?dd)-(?<month>dd)-(?<day>dd))
count = 9

type  = duplicate names
regex = (?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)
count = 6

type  = duplicate numbers
regex = (?|(Sat)ur|(Sun))day
count = 1