We know that WordPress cannot manage 4-bytes-long UTF-8 characters since the bug report Inserting a 4-byte UTF-8 character truncates data was filed on May 28, 2010. As of today (four years later) #13590 is still not fixed.
I contributed first a patch for fixing the bug on January 19, 2011 and then a plugin for fixing WP sites on May 25, 2011. They work in a very simple way:
- immediately before writing to the database, a 4-bytes-long UTF-8 character is converted to a safe code
- immediately after reading from the database, a safe code is converted to a 4-bytes-long UTF-8 character
Escaping / unescaping is very useful because PHP, which can manage 4-bytes-long UTF-8 characters, will manage them and MySQL, which cannot manage them (*), will manage a safe code instead.
There are only a couple of issues with this method, though. If a string contains some 4-bytes-long UTF-8 characters then
- the length of the string in PHP will appear lower than in MySQL
- the order of the string in PHP will be different from MySQL
For me, they are minor problems, when compared to the peace of mind of being able to paste or edit whatever UTF-8 text I might come up with.
As I said, #13590 is still not fixed after 4 years. Meanwhile,
- 2010-05-28 06:53 – #13590 opened – sardisson filed Inserting a 4-byte UTF-8 character truncates data
- 2010-10-27 12:28 – #13590 – Component changed from General to Charset – Milestone changed from Awaiting Review to Future Release
- 2010-12-31 13:21 – #13590 closed – Resolution set to invalid – Status changed from new to closed
- 2011-01-13 02:55 – #13590 – Milestone Future Release deleted
- 2011-01-19 09:32 – #13590 reopened – Component changed from Charset to Database – Keywords utf8 added – Resolution invalid deleted – Status changed from closed to reopened – Summary changed from Inserting a tetragram (SMP/Plane 1) character truncates post fields to Inserting a 4-byte UTF-8 character truncates data – Version changed from 2.9.2 to 3.0.4
- 2011-01-19 09:57 – #13590 – Milestone set to Awaiting Review
- 2011-01-19 21:40 – #13590 – Attachment wp-db-utf8-patch.diff added (my patch)
- 2011-05-25 – I published my plugin
- 2011-05-25 21:43 – #13590 – Keywords has-patch added – Type changed from defect (bug) to enhancement
- 2012-07-11 04:26 – #21212 opened – Gary Pendergast (pento) filed MySQL tables should use utf8mb4 character set, basically suggesting to take advantage of utf8mb4, which is how MySQL developers fixed their bug. (*)
- 2012-08-08 06:49 – #13590 – Milestone changed from Awaiting Review to 3.5
- 2012-08-29 01:54 – #13590 closed – Keywords utf8 removed – Milestone 3.5 deleted – Resolution set to maybelater – Status changed from reopened to closed – Version changed from 3.0.4 to 2.9.2
- 2013-02-18 21:36 – #13590 – Duplicated by #23495
- 2012-07-11 04:28 – #21212 – Attachment 21212-utf8mb4.diff added
- 2012-07-11 04:29 – #21212 – Keywords has-patch added
- 2012-07-11 11:22 – #21212 – Related to #13590
- 2012-07-18 06:52 – #21212 – Attachment 21212-utf8mb4.2.diff added
- 2012-07-30 – Mathias Bynens published How to support full Unicode in MySQL databases
- 2012-08-07 04:13 – #21212 – Attachment 21212-utf8mb4.3.diff added
- 2012-08-07 05:06 – #21212 – Milestone changed from Awaiting Review to 3.5
- 2012-08-07 06:39 – #21212 – Keywords commit added
- 2012-08-29 01:54 – #21212 closed – Milestone 3.5 deleted – Resolution set to maybelater – Status changed from new to closed
- 2012-10-09 05:24 – #21212 reopened – Keywords has-patch commit removed – Resolution maybelater deleted – Status changed from closed to reopened
- 2012-10-09 06:56 – #21212 – Milestone set to Awaiting Review
- 2014-04-07 – Gary Pendergast (pento) published WordPress and UTF-8
- 2014-04-22 02:08 – #21212 – Duplicated by #27961
- 2014-06-11 – Andrew Nacin (nacin) and core WP developers started thinking about a fix.
- 2014-08-22 15:47 – #21212 – Duplicated by #29322
- 2014-09-28 13:20 – #21212 – Duplicated by #29773
- 2014-10-04 18:49 – #21212 – Duplicated by #29857
FWIW, since I’ve been using my plugin I forgot about WordPress and UTF-8.