The past few days, I’ve been trying to fix the oldest WordPress bug still open (#2691), which is about HTML comments improperly formatted by the ´wpautop´ function while converting double new lines to paragraphs. Looking into the function’s code it’s easy to understand why the bug was never fixed. In fact the code is mostly working and it’s quite cryptical: It executes about 30 regex matching and replacements in a row. Along the short history of the bug, someone provided a patch based on the PHP DOM classes. Its author didn’t look too enthusiastic and I must say his code was not much neater than the original.
So I wanted to give it a shot, as a programming challenge. After a couple of weeks, I’m still undefeated, but quite near to give up. I’ve already created many auxiliary classes, which I started publishing here, but now are collected under my Ando project on GitHub. I’m currently trying to validate HTML with a parser and a specification. The parser, which I thought was going to be the most difficult part, is instead almost done. The specification, which I thought I could avoid writing, is instead taking a lot because the source is quite confusing (to me) and it uses convoluted prose in many places.
One of those places is the concept of transparent content model.
Some elements are described as transparent; they have “transparent” in the description of their content model. The content model of a transparent element is derived from the content model of its parent element: the elements required in the part of the content model that is “transparent” are the same elements as required in the part of the content model of the parent of the transparent element in which the transparent element finds itself.
[…]
In some cases, where transparent elements are nested in each other, the process has to be applied iteratively.
[…]
When a transparent element has no parent, then the part of its content model that is “transparent” must instead be treated as accepting any flow content.
That’s it, verbatim. Of course, after translating it to English, that would read:
Transparent elements are those which have the word “transparent” in at least one part of their content model. A transparent part allows the same elements allowed by the closest ancestor element whose containing part is non-transparent. If no such ancestor exists then a transparent part allows any flow content.
BUT, the problems with transparent elements do not end there. The second problem is that there is no list of all transparent elements: I had to “stumble” upon them. However, here they are.
element | content model |
---|---|
a | Transparent, but there must be no interactive content descendant. |
ins | Transparent. |
del | Transparent. |
object | Zero or more param elements, then, transparent. |
video | If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants. If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants. |
audio | If the element has a src attribute: zero or more track elements, then transparent, but with no media element descendants. If the element does not have a src attribute: zero or more source elements, then zero or more track elements, then transparent, but with no media element descendants. |
map | Transparent. |
noscript | When scripting is disabled, in a head element: in any order, zero or more link elements, zero or more style elements, and zero or more meta elements. When scripting is disabled, not in a head element: transparent, but there must be no noscript element descendants. Otherwise: text that conforms to the requirements given in the prose. |
canvas | Transparent, but with no interactive content descendants except for a elements, img elements with usemap attributes, button elements, input elements whose type attribute are in the Checkbox or Radio Button states, input elements that are buttons, select elements with a multiple attribute or a display size greater than 1, sorting interface th elements, and elements that would not be interactive content except for having the tabindex attribute specified. |
Those definitions are again verbatim: Their cumbersome prose is evident.
The third problem with transparent elements is that, by definition, their content model is inherited. So, for example, by looking at the description of the content model of the ´a´ element we can’t get any idea about its allowed content, except that interactive content is not, in this case. It means that an ´a´ element can wrap almost anything, as it is also stated at the end of its chapter:
The ´a´ element may be wrapped around entire paragraphs, lists, tables, and so forth, even entire sections, so long as there is no interactive content within (e.g. buttons or other links).
The thing to note here is the possibility implied by the prose “may be”. The fact that an ´a´ element is wrapped around those other elements doesn’t mean it’s valid HTML. In fact, to compute if such a structure is valid, one has to look at the containment chain, from that element outward, up to the first non-transparent part. Quite a challenge.