I wanted to parse the host of a URL with a regular expression to get its third level domain {[.pattern | 1.hilite(=php=)]}
Let’s test the general case with http://www.dnr.state.oh.us/ {[.example-1 | 1.hilite(=php=)]}
array(2) { [0]=> string(19) "www.dnr.state.oh.us" [1]=> string(5) "state" }
Pretty good. And now let’s test the edge case with http://google.com/ {[.example-2 | 1.hilite(=php=)]}
array(1) { [0]=> string(10) "google.com" }
WTF, where is my empty submatch? Since when an optional submatch is not a submatch if it’s empty?
I googled it and found that there is already a filed bug. The chosen resolution has been won’t fix!! They say for backward compatibility, but I cannot imagine how fixing it would break anything older.
- If I expect 3 submatches from my pattern, but I get 2, then I know (for the bug) that the missing submatch is the last one and it’s an empty string. So I add it myself to the submatches array. Would a programmer do anything different to fix this bug?
- If the bug is globally fixed, it means that my old code will always get 3 submatches from that pattern. So my individual fix won’t get triggered, and having the last submatch the same value (empty string) as the one my fix would have added, I won’t have any issue, except a bit of (stale) unused code.
To cleanly fix it myself once and for all, I’ve written a wrapper ando_preg_match that has the same signature and the expected results.
EDIT: There were some bugs in my own fix to the preg_match bug. For the code, please see the new post.
In the edge case I get now
array(2) { [0]=> string(10) "google.com" [1]=> string(0) "" }
Unfortunately the wrapper is more complex than I like, but PHP allows regular expressions with named groups and they require a lot of additional code. Anyway I’ve been able to do it all in a single function that can be easily dropped in any project.
Here is a test with a pattern with named groups, just in case you were wondering what it looks like {[.example-3 | 1.hilite(=php=)]}
array(3) { [0]=> string(10) "google.com" ["subdomain"]=> string(0) "" [1]=> string(0) "" }
Actually, this last example allows me to show that my wrapper is really returning the expected result. In fact, just by adding a last non-empty group to the previous pattern, the original and buggy preg_match will work just fine {[.example-4 | 1.hilite(=php=)]}
array(4) { [0]=> string(10) "google.com" ["subdomain"]=> string(0) "" [1]=> string(0) "" [2]=> string(10) "google.com" }
Of course you’ll get the same result using the wrapper.