Escaping and unescaping UTF-8 characters in PHP

The Zend_Json_Encoder class implements three clearly different functionalities: encoding PHP values, encoding PHP classes, and escaping UTF-8 characters. And still in the last release-1.11.2 the UTF-8 escaping feature doesn’t take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between U+10000 and U+10FFFF. Douglas Crockford gives an example of how to escape extended characters in the RFC4627 by means of surrogate pairs:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

The encodeUnicodeString() and its ancillary _utf82utf16() come directly from the Solar Framework. I think it’s fair to copy code from one open source project to another, but some insight is necessary for telling apart what is well made from what is not. In this case, encodeUnicodeString() supports up to 6 bytes characters and the _utf82utf16() supports up to 3 bytes characters, but UTF-8 characters are 1 to 4 bytes long!! And I can’t believe that those two functions destroy any character they cannot process!

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it’s sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework, so I made up Zend_Utf8, which I’m going to briefly introduce.

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I’ll describe only usage of the latter.

Example

Here is a simple program that shows some simple usage

{[ .example | 1.hilite(=php,ln-1=) ]}

And this is the output

The white space inside brackets [	] is a common tab.
The white space inside brackets [u0009] is a common tab.
-- escaped and unescaped as expected
-- escape DOESN'T WORK like in json_encode
-- unescape works like in json_decode

The kanji inside brackets [水] is read mizu and means water in Japanese.
The kanji inside brackets [u6c34] is read mizu and means water in Japanese.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

The symbol inside brackets [] is NOT a G clef.
The symbol inside brackets [] is NOT a G clef.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

Please note:

  1. in the first case it says “escape DOESN’T WORK like in json_encode” because json_encode use to replace all the JSON supported special control characters with their respective counterparts, this time a tab control character is replaced by t
  2. in the third case it says “is NOT” because I’ve just found out that MySQL doesn’t support extended UTF-8 characters, so WordPress silently breaks down. In the excerpt from the RFC4627 I cited above I’ve been able to display it by means of the HTML entity [𝄞] but in the example I can’t use an entity; for this reason here is a screen capture of the output as it appears in the debug window of Zend Studio
  3. after the WordPress fix I’m going to develop for making it use a G clef seamlessly, I’ll introduce advanced usage, by modifying default options to some other interesting values, like how to escape for HTML. Meanwhile, please refer to the Zend Framework proposal

Class

{[ .zend_utf8 | 1.hilite(=php,ln-1=) ]}

References

UTF-8 and Unicode

Unicode

3 Replies to “Escaping and unescaping UTF-8 characters in PHP”

  1. I am using PHP5.2 and json_encode is converting all Chinese characters into utf8 codes. The built-in PHP utf8_decode function didn’t work at all. None of the solutions I found on Stackoverflow or PHP manual worked. So frustrating, argh!

    I’m so glad in the end I found this page. This one works like a charm!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.