Weapon of Choice

2011-03-22

Addictive Hacker News Horse Race Bookmarklet

I like to read Hacker News at Y Combinator. And I like to see how comments and points build up over time.

Some days ago I got this very stupid idea: what about making news run like horses in a race. In this way I could instantly spot the leading news.

So I wrote this bookmarklet and now I’m hooked. I can’t give up on it. Many times I just hit reload to see who’s leading the race now. I’m totally addicted.

BEFORE

AFTER

The bookmarklet

Drag and drop this link to your bookmarks bar, and rename it to HN Horse Race.

{[ .horserace | 1.hilite(=javascript=) ]}

2011-01-14

Full UTF-8 support in WordPress

A few days ago I discovered that WordPress didn’t support full UTF-8 strings, whose characters are 1 to 4 bytes long. Instead it does support all unicodes belonging to the BMP, whose UTF-8 characters are 1 to 3 bytes long.

This WordPress defect is “caused by” MySQL 5, which only supports UTF-8 characters in the BMP. Apparently, MySQL 6 will be full UTF-8 compliant.

This morning, with the help of the UTF-8 class I recently developed, I made up a new WordPress plugin that adds full UTF-8 support to WordPress.

And this is the same sentence by Douglas Crockford, from the RFC4627 I cited in the previous post:

a string containing only the G clef character [?] may be represented as “uD834uDD1E”

Windows users see a rectangle: it’s a Windows feature, but they should see the following thing

Imagen 1

You should note that the G clef above (not the one in the picture 😉 appears in the HTML not as an entity but as a common UTF-8 character, entered as is in the WordPress editor. You can see it for yourself by comparing the source code of this post (1) with that of the previous one (2).

<blockquote><p>a string containing only the G clef character [<a href="http://www.fileformat.info/info/unicode/char/1d11e/index.htm" target="_blank"><span style="font-size: 2em;">?</span></a>] may be represented as “uD834uDD1E”</p></blockquote>

<blockquote><p>a string containing only the G clef character [<a href="http://www.fileformat.info/info/unicode/char/1d11e/index.htm" target="_blank"><span style="font-size: 2em;">&#119070;</span></a>] may be represented as &#8220;uD834uDD1E&#8221;</p></blockquote>

Note that my plugin works for ~~post and page content, title, excerpt, and also for searches, but it doesn’t cover custom fields~~ (since version 2.0.0) any character written to and read from the database. ~~For this reason~~ Anyway, I~~’ve just~~ opened a ticket about this issue in the WordPress Trac: please drop by and comment 🙂

What follows is the code of my Zend_Utf8 class, which I included in the plugin, after de-Zend-ifying all of it, for safe distribution in the wild.

{[ .Ando_Utf8 | 1.hilite(=php=) ]}

2011-01-10

Escaping and unescaping UTF-8 characters in PHP

The Zend_Json_Encoder class implements three clearly different functionalities: encoding PHP values, encoding PHP classes, and escaping UTF-8 characters. And still in the last release-1.11.2 the UTF-8 escaping feature doesn’t take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between U+10000 and U+10FFFF. Douglas Crockford gives an example of how to escape extended characters in the RFC4627 by means of surrogate pairs:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

The encodeUnicodeString() and its ancillary _utf82utf16() come directly from the Solar Framework. I think it’s fair to copy code from one open source project to another, but some insight is necessary for telling apart what is well made from what is not. In this case, encodeUnicodeString() supports up to 6 bytes characters and the _utf82utf16() supports up to 3 bytes characters, but UTF-8 characters are 1 to 4 bytes long!! And I can’t believe that those two functions destroy any character they cannot process!

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it’s sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework, so I made up Zend_Utf8, which I’m going to briefly introduce.

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I’ll describe only usage of the latter.

Example

Here is a simple program that shows some simple usage

{[ .example | 1.hilite(=php,ln-1=) ]}

And this is the output

The white space inside brackets [	] is a common tab.
The white space inside brackets [u0009] is a common tab.
-- escaped and unescaped as expected
-- escape DOESN'T WORK like in json_encode
-- unescape works like in json_decode

The kanji inside brackets [水] is read mizu and means water in Japanese.
The kanji inside brackets [u6c34] is read mizu and means water in Japanese.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

The symbol inside brackets [] is NOT a G clef.
The symbol inside brackets [] is NOT a G clef.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

Please note:

in the first case it says “escape DOESN’T WORK like in json_encode” because json_encode use to replace all the JSON supported special control characters with their respective counterparts, this time a tab control character is replaced by t
in the third case it says “is NOT” because I’ve just found out that MySQL doesn’t support extended UTF-8 characters, so WordPress silently breaks down. In the excerpt from the RFC4627 I cited above I’ve been able to display it by means of the HTML entity [𝄞] but in the example I can’t use an entity; for this reason here is a screen capture of the output as it appears in the debug window of Zend Studio
after the WordPress fix I’m going to develop for making it use a G clef seamlessly, I’ll introduce advanced usage, by modifying default options to some other interesting values, like how to escape for HTML. Meanwhile, please refer to the Zend Framework proposal

Class

{[ .zend_utf8 | 1.hilite(=php,ln-1=) ]}

References

UTF-8 and Unicode

Unicode