Full UTF-8 support in WordPress

A few days ago I discovered that WordPress didn’t support full UTF-8 strings, whose characters are 1 to 4 bytes long. Instead it does support all unicodes belonging to the BMP, whose UTF-8 characters are 1 to 3 bytes long.

This WordPress defect is “caused by” MySQL 5, which only supports UTF-8 characters in the BMP. Apparently, MySQL 6 will be full UTF-8 compliant.

This morning, with the help of the UTF-8 class I recently developed, I made up a new WordPress plugin that adds full UTF-8 support to WordPress.

And this is the same sentence by Douglas Crockford, from the RFC4627 I cited in the previous post:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

Windows users see a rectangle: it’s a Windows feature, but they should see the following thing

Imagen 1

You should note that the G clef above (not the one in the picture 😉 appears in the HTML not as an entity but as a common UTF-8 character, entered as is in the WordPress editor. You can see it for yourself by comparing the source code of this post (1) with that of the previous one (2).

Note that my plugin works for post and page content, title, excerpt, and also for searches, but it doesn’t cover custom fields (since version 2.0.0) any character written to and read from the database. For this reason Anyway, I’ve just opened a ticket about this issue in the WordPress Trac: please drop by and comment 🙂

What follows is the code of my Zend_Utf8 class, which I included in the plugin, after de-Zend-ifying all of it, for safe distribution in the wild.

{[ .Ando_Utf8 | 1.hilite(=php=) ]}

Escaping and unescaping UTF-8 characters in PHP

The Zend_Json_Encoder class implements three clearly different functionalities: encoding PHP values, encoding PHP classes, and escaping UTF-8 characters. And still in the last release-1.11.2 the UTF-8 escaping feature doesn’t take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between U+10000 and U+10FFFF. Douglas Crockford gives an example of how to escape extended characters in the RFC4627 by means of surrogate pairs:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

The encodeUnicodeString() and its ancillary _utf82utf16() come directly from the Solar Framework. I think it’s fair to copy code from one open source project to another, but some insight is necessary for telling apart what is well made from what is not. In this case, encodeUnicodeString() supports up to 6 bytes characters and the _utf82utf16() supports up to 3 bytes characters, but UTF-8 characters are 1 to 4 bytes long!! And I can’t believe that those two functions destroy any character they cannot process!

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it’s sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework, so I made up Zend_Utf8, which I’m going to briefly introduce.

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I’ll describe only usage of the latter.

Example

Here is a simple program that shows some simple usage

{[ .example | 1.hilite(=php,ln-1=) ]}

And this is the output

Please note:

  1. in the first case it says “escape DOESN’T WORK like in json_encode” because json_encode use to replace all the JSON supported special control characters with their respective counterparts, this time a tab control character is replaced by t
  2. in the third case it says “is NOT” because I’ve just found out that MySQL doesn’t support extended UTF-8 characters, so WordPress silently breaks down. In the excerpt from the RFC4627 I cited above I’ve been able to display it by means of the HTML entity [𝄞] but in the example I can’t use an entity; for this reason here is a screen capture of the output as it appears in the debug window of Zend Studio
  3. after the WordPress fix I’m going to develop for making it use a G clef seamlessly, I’ll introduce advanced usage, by modifying default options to some other interesting values, like how to escape for HTML. Meanwhile, please refer to the Zend Framework proposal

Class

{[ .zend_utf8 | 1.hilite(=php,ln-1=) ]}

References

UTF-8 and Unicode

Unicode

Detecting recursive dependencies in PHP composite values

A great deal of complexity in the Zend_Json_Encoder class (like having a static method that instantiates the hosting class) is due to the implementation of a recursive dependency check.

Composite values can contain parts that contain the whole. In general it’s a good feature for data, but it allows a program to exhaust processing power, by entering infinite recursion. A developer MUST avoid infinite recursion, and every developer knows it. So the recursive dependency check is a requirement of any trustworthy PHP to JSON encoder, and the Developer of the Zend_Json_Encoder class put a lot of effort into giving a satisfactory solution to the problem.

Issue 1: False positives

The implementation of the recursive dependency check goes like this: if an object is found twice while visiting an object then it’s considered a recursive dependency. Unfortunately that’s a necessary but not sufficient condition. In fact, as a reported bug made clear, it’s very easy (and useful) to craft a composite value with the same object twice, and no recursive dependency involved.

The bug fix was quite wacky. Firstly the Developer added a switch for turning on the check, and made the switch off by default. Secondly they added another switch for turning off the throw of an exception so that when a recursive dependency is detected a standard string is returned instead of encoding the recurring object again. The problem is that this mechanism still suffer the same issue: it cuts recursive dependencies as well as simple repetitions (false positives).

Nobody should ever turn on the recursive dependency check in the Zend_Json_Encoder class because the only reason for doing so is when a developer wants the encoding to be performed no matter what. This can be accomplished by turning the check on and exceptions off. But the presence of false positives makes it a bad choice anyway.

Issue 2: Arrays recur too

The implementation of the recursive dependency check takes into account objects, but arrays can recur too, and no check for arrays is available. This is like generating false expectations: a developer that turned on the check would expect all recurring dependencies being detected, even if some could be false positives, but this is not the case. The check on means only recurring/repeating objects will be found.

Observations

We’ll make now some little experiments, with recurring arrays and objects, and see how they get printed, serialized and json_encoded by PHP itself.

{[ .experiments | 1.hilite(=php,ln-1=) ]}

Comparing the output in the cases with a recursive dependency, we see that PHP by itself detects recursion (we trust PHP, so we knew it MUST avoid recursion):

  • print_r() shows a *RECURSION* label
  • json_encode() shows a Warning
  • serialize() shows a metadata element labeled R: for arrays and r: for objects

Comparing the output in the cases without a recursive dependency, we also see that PHP doesn’t detect recursion. Hmm, almost.

  • print_r() doesn’t show any *RECURSION* label
  • json_encode() doesn’t show any Warning
  • serialize() doesn’t show any metadata element labeled R: for arrays… BUT does show r: for objects

In fact, the last case is a simpler version of the example given for reproducing the bug of the Zend_Json_Encoder class: a composite value with the same object twice.

It seems that the serialize function of PHP is affected by the same bug. Or should we assume that the R/r is for repetition instead of recursion? Anyway, the serialize function cannot be trusted.

Solution

We’re going to exploit the fact that print_r() emits a *RECURSION* label. If we tried to match the label in the string returned by print_r(), and no one existed, then it would mean that there is no recursion. But if we got a match, that one could be a false positive if user data contained the *RECURSION* substring.

So we need a means for getting rid of any of those user data substrings that could pollute our matches. Well, luckily serialize doesn’t use a *RECURSION* label in its metadata, so any that could occur would be user data.

{[ .solution | 1.hilite(=php,ln-1=) ]}

This solution has many advantages:

  1. is trustworthy (PHP does the job)
  2. works for objects as well as arrays
  3. is short and simple to understand
  4. can be applied before walking a value

And very few disadvantages:

  1. doesn’t allow to spot where the recursion occurs
  2. could be slow for complex structures