Escaping and unescaping UTF-8 characters in PHP

The Zend_Json_Encoder class implements three clearly different functionalities: encoding PHP values, encoding PHP classes, and escaping UTF-8 characters. And still in the last release-1.11.2 the UTF-8 escaping feature doesn’t take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between U+10000 and U+10FFFF. Douglas Crockford gives an example of how to escape extended characters in the RFC4627 by means of surrogate pairs:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

The encodeUnicodeString() and its ancillary _utf82utf16() come directly from the Solar Framework. I think it’s fair to copy code from one open source project to another, but some insight is necessary for telling apart what is well made from what is not. In this case, encodeUnicodeString() supports up to 6 bytes characters and the _utf82utf16() supports up to 3 bytes characters, but UTF-8 characters are 1 to 4 bytes long!! And I can’t believe that those two functions destroy any character they cannot process!

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it’s sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework, so I made up Zend_Utf8, which I’m going to briefly introduce.

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I’ll describe only usage of the latter.

Example

Here is a simple program that shows some simple usage

{[ .example | 1.hilite(=php,ln-1=) ]}

And this is the output

The white space inside brackets [	] is a common tab.
The white space inside brackets [u0009] is a common tab.
-- escaped and unescaped as expected
-- escape DOESN'T WORK like in json_encode
-- unescape works like in json_decode

The kanji inside brackets [水] is read mizu and means water in Japanese.
The kanji inside brackets [u6c34] is read mizu and means water in Japanese.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

The symbol inside brackets [] is NOT a G clef.
The symbol inside brackets [] is NOT a G clef.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

Please note:

  1. in the first case it says “escape DOESN’T WORK like in json_encode” because json_encode use to replace all the JSON supported special control characters with their respective counterparts, this time a tab control character is replaced by t
  2. in the third case it says “is NOT” because I’ve just found out that MySQL doesn’t support extended UTF-8 characters, so WordPress silently breaks down. In the excerpt from the RFC4627 I cited above I’ve been able to display it by means of the HTML entity [𝄞] but in the example I can’t use an entity; for this reason here is a screen capture of the output as it appears in the debug window of Zend Studio
  3. after the WordPress fix I’m going to develop for making it use a G clef seamlessly, I’ll introduce advanced usage, by modifying default options to some other interesting values, like how to escape for HTML. Meanwhile, please refer to the Zend Framework proposal

Class

{[ .zend_utf8 | 1.hilite(=php,ln-1=) ]}

References

UTF-8 and Unicode

Unicode

Detecting recursive dependencies in PHP composite values

A great deal of complexity in the Zend_Json_Encoder class (like having a static method that instantiates the hosting class) is due to the implementation of a recursive dependency check.

Composite values can contain parts that contain the whole. In general it’s a good feature for data, but it allows a program to exhaust processing power, by entering infinite recursion. A developer MUST avoid infinite recursion, and every developer knows it. So the recursive dependency check is a requirement of any trustworthy PHP to JSON encoder, and the Developer of the Zend_Json_Encoder class put a lot of effort into giving a satisfactory solution to the problem.

Issue 1: False positives

The implementation of the recursive dependency check goes like this: if an object is found twice while visiting an object then it’s considered a recursive dependency. Unfortunately that’s a necessary but not sufficient condition. In fact, as a reported bug made clear, it’s very easy (and useful) to craft a composite value with the same object twice, and no recursive dependency involved.

The bug fix was quite wacky. Firstly the Developer added a switch for turning on the check, and made the switch off by default. Secondly they added another switch for turning off the throw of an exception so that when a recursive dependency is detected a standard string is returned instead of encoding the recurring object again. The problem is that this mechanism still suffer the same issue: it cuts recursive dependencies as well as simple repetitions (false positives).

Nobody should ever turn on the recursive dependency check in the Zend_Json_Encoder class because the only reason for doing so is when a developer wants the encoding to be performed no matter what. This can be accomplished by turning the check on and exceptions off. But the presence of false positives makes it a bad choice anyway.

Issue 2: Arrays recur too

The implementation of the recursive dependency check takes into account objects, but arrays can recur too, and no check for arrays is available. This is like generating false expectations: a developer that turned on the check would expect all recurring dependencies being detected, even if some could be false positives, but this is not the case. The check on means only recurring/repeating objects will be found.

Observations

We’ll make now some little experiments, with recurring arrays and objects, and see how they get printed, serialized and json_encoded by PHP itself.

{[ .experiments | 1.hilite(=php,ln-1=) ]}

-------------------- $array1 with recursive dependency --------------------

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => Array
 *RECURSION*
                )

        )

)

serialized -> a:1:{i:0;a:1:{i:0;a:1:{i:0;R:2;}}}
json_encoded -> 
Warning: json_encode(): recursion detected in /Users/aercolino/...
[[[null]]]
---

-------------------- $object1 with recursive dependency --------------------

stdClass Object
(
    [member1] => stdClass Object
        (
            [member2] => stdClass Object
 *RECURSION*
        )

)

serialized -> O:8:"stdClass":1:{s:7:"member1";O:8:"stdClass":1:{s:7:"member2";r:1;}}
json_encoded -> 
Warning: json_encode(): recursion detected in /Users/aercolino/...
{"member1":{"member2":{"member1":null}}}
---

-------------------- $array1 without recursive dependency --------------------

Array
(
    [0] => Array
        (
        )

)

serialized -> a:1:{i:0;a:0:{}}
json_encoded -> [[]]
---

-------------------- $object1 without recursive dependency --------------------

stdClass Object
(
    [member1] => stdClass Object
        (
        )

    [member2] => stdClass Object
        (
        )

)

serialized -> O:8:"stdClass":2:{s:7:"member1";O:8:"stdClass":0:{}s:7:"member2";r:2;}
json_encoded -> {"member1":{},"member2":{}}
---

Comparing the output in the cases with a recursive dependency, we see that PHP by itself detects recursion (we trust PHP, so we knew it MUST avoid recursion):

  • print_r() shows a *RECURSION* label
  • json_encode() shows a Warning
  • serialize() shows a metadata element labeled R: for arrays and r: for objects

Comparing the output in the cases without a recursive dependency, we also see that PHP doesn’t detect recursion. Hmm, almost.

  • print_r() doesn’t show any *RECURSION* label
  • json_encode() doesn’t show any Warning
  • serialize() doesn’t show any metadata element labeled R: for arrays… BUT does show r: for objects

In fact, the last case is a simpler version of the example given for reproducing the bug of the Zend_Json_Encoder class: a composite value with the same object twice.

It seems that the serialize function of PHP is affected by the same bug. Or should we assume that the R/r is for repetition instead of recursion? Anyway, the serialize function cannot be trusted.

Solution

We’re going to exploit the fact that print_r() emits a *RECURSION* label. If we tried to match the label in the string returned by print_r(), and no one existed, then it would mean that there is no recursion. But if we got a match, that one could be a false positive if user data contained the *RECURSION* substring.

So we need a means for getting rid of any of those user data substrings that could pollute our matches. Well, luckily serialize doesn’t use a *RECURSION* label in its metadata, so any that could occur would be user data.

{[ .solution | 1.hilite(=php,ln-1=) ]}

This solution has many advantages:

  1. is trustworthy (PHP does the job)
  2. works for objects as well as arrays
  3. is short and simple to understand
  4. can be applied before walking a value

And very few disadvantages:

  1. doesn’t allow to spot where the recursion occurs
  2. could be slow for complex structures

Translating a string from PHP to JSON

Based on my understanding of this subject, I’ve come up with the following function for translating a string from PHP to JSON, strictly conforming to the RFC4627.

{[ .json_string | 1.hilite(=php,ln-1=) ]}

A simple test like this
{[ .test | 1.hilite(=php,ln-1=) ]}

yields (in comparison to the _encodeString method of the Zend_Json_Encoder class of Zend Framework)

Zend_Json_Encoder::_encodeString: Array
(
    [0] => "a null: ; a new line: n; a carriage return: r;"
    [1] => "a js regex: /(["'])\w+\1/"
    [2] => "a script element: <script type="test/javascript" src="http://example.com/all.js"></script>"
    [3] => "a japanese word: u307fu305a"
)
json_string: Array
(
    [0] => "a null: u0000; a new line: n; a carriage return: r;"
    [1] => "a js regex: /(["'])\w+\1/"
    [2] => "a script element: <script type="test/javascript" src="http://example.com/all.js"></script>"
    [3] => "a japanese word: みず"
)