WordPress and UTF-8

We know that WordPress cannot manage 4-bytes-long UTF-8 characters since the bug report Inserting a 4-byte UTF-8 character truncates data was filed on May 28, 2010. As of today (four years later) #13590 is still not fixed.

I contributed first a patch for fixing the bug on January 19, 2011 and then a plugin for fixing WP sites on May 25, 2011. They work in a very simple way:

  1. immediately before writing to the database, a 4-bytes-long UTF-8 character is converted to a safe code
  2. immediately after reading from the database, a safe code is converted to a 4-bytes-long UTF-8 character

Escaping / unescaping is very useful because PHP, which can manage 4-bytes-long UTF-8 characters, will manage them and MySQL, which cannot manage them (*), will manage a safe code instead.

There are only a couple of issues with this method, though. If a string contains some 4-bytes-long UTF-8 characters then

  1. the length of the string in PHP will appear lower than in MySQL
  2. the order of the string in PHP will be different from MySQL

For me, they are minor problems, when compared to the peace of mind of being able to paste or edit whatever UTF-8 text I might come up with.

As I said, #13590 is still not fixed after 4 years. Meanwhile,

  1. 2010-05-28 06:53 – #13590 opened – sardisson filed Inserting a 4-byte UTF-8 character truncates data
  2. 2010-10-27 12:28 – #13590 – Component changed from General to Charset – Milestone changed from Awaiting Review to Future Release
  3. 2010-12-31 13:21 – #13590 closed – Resolution set to invalid – Status changed from new to closed
  4. 2011-01-13 02:55 – #13590 – Milestone Future Release deleted
  5. 2011-01-19 09:32 – #13590 reopened – Component changed from Charset to Database – Keywords utf8 added – Resolution invalid deleted – Status changed from closed to reopened – Summary changed from Inserting a tetragram (SMP/Plane 1) character truncates post fields to Inserting a 4-byte UTF-8 character truncates data – Version changed from 2.9.2 to 3.0.4
  6. 2011-01-19 09:57 – #13590 – Milestone set to Awaiting Review
  7. 2011-01-19 21:40 – #13590 – Attachment wp-db-utf8-patch.diff​ added (my patch)
  8. 2011-05-25 – I published my plugin
  9. 2011-05-25 21:43 – #13590 – Keywords has-patch added – Type changed from defect (bug) to enhancement
  10. 2012-07-11 04:26 – #21212 opened – Gary Pendergast (pento) filed MySQL tables should use utf8mb4 character set, basically suggesting to take advantage of utf8mb4, which is how MySQL developers fixed their bug. (*)
  11. 2012-08-08 06:49 – #13590 – Milestone changed from Awaiting Review to 3.5
  12. 2012-08-29 01:54 – #13590 closed – Keywords utf8 removed – Milestone 3.5 deleted – Resolution set to maybelater – Status changed from reopened to closed – Version changed from 3.0.4 to 2.9.2
  13. 2013-02-18 21:36 – #13590 – Duplicated by #23495
  14. 2012-07-11 04:28 – #21212 – Attachment 21212-utf8mb4.diff​ added
  15. 2012-07-11 04:29 – #21212 – Keywords has-patch added
  16. 2012-07-11 11:22 – #21212 – Related to #13590
  17. 2012-07-18 06:52 – #21212 – Attachment 21212-utf8mb4.2.diff​ added
  18. 2012-07-30 – Mathias Bynens published How to support full Unicode in MySQL databases
  19. 2012-08-07 04:13 – #21212 – Attachment 21212-utf8mb4.3.diff​ added
  20. 2012-08-07 05:06 – #21212 – Milestone changed from Awaiting Review to 3.5
  21. 2012-08-07 06:39 – #21212 – Keywords commit added
  22. 2012-08-29 01:54 – #21212 closed – Milestone 3.5 deleted – Resolution set to maybelater – Status changed from new to closed
  23. 2012-10-09 05:24 – #21212 reopened – Keywords has-patch commit removed – Resolution maybelater deleted – Status changed from closed to reopened
  24. 2012-10-09 06:56 – #21212 – Milestone set to Awaiting Review
  25. 2014-04-07 – Gary Pendergast (pento) published WordPress and UTF-8
  26. 2014-04-22 02:08 – #21212 – Duplicated by #27961
  27. 2014-06-11 – Andrew Nacin (nacin) and core WP developers started thinking about a fix.
  28. 2014-08-22 15:47 – #21212 – Duplicated by #29322
  29. 2014-09-28 13:20 – #21212 – Duplicated by #29773
  30. 2014-10-04 18:49 – #21212 – Duplicated by #29857

FWIW, since I’ve been using my plugin I forgot about WordPress and UTF-8.

How to install a WordPress development environment

WordPress 4.0 has forced me to update the plugins I created a long time ago to explicitly state that they are still compatible. I understand the rationale, so, eventually and reluctantly, I “decided” to comply.

To feel modern myself, I’ve been playing with Varying Vagrant Vagrants. It’s fantastic: It downloads and installs tons of stuff all by itself. It was a real pleasure to see it work… when eventually did.

First issue: (host) brew was outdated

In fact the installation for me wasn’t completely effortless, which should be VVV’s target. To be able to “4. Install the vagrant-hostsupdater” I needed to force brew to update. I did it like this:

That allowed me to complete the steps from “1. Start” all the way up to “7. …change into the new directory…” without a glitch.

Second issue: (guest) SSH was not working

But as soon as I tried to ´vagrant up´

I got lots of retries. Eventually they ended, but the new virtual machine was broken, BADLY broken. In fact I could not connect to it using ´vagrant ssh´ even if according to ´vagrant status´ it was peacefully running. Neither ´vagrant destroy´ worked of course.

My only option was to kill the virtual machine from VirtualBox and look into the Vagrantfile and see if I could tweak it somehow here and there. I tried really hard but the result never changed.

So I tried one last thing: Get rid of the virtual box and repeat ´vagrant up´ once more. VirtualBox virtual machines on OSX are stored into folders like ´vvv_default_1412029772978_15676´ usually into ´~/VirtualBox VMs´.

If there were many such subfolders, the right one is obtained first by finding out the id of the virtual machine at hand and then looking for it into the list of installed virtual machines.

After unregistering the virtual machine, removing the folder and issuing ´vagrant up´ again… magic… SSH was working and the install script could complete. Why not before? No idea. But here was the pleasure I referred to earlier. It went on and on for many minutes, printing thousands of green lines until it gracefully ended.

Third issue: (outgoing) mail was not delivered

Then I could work on everything I wanted. Connect to the databases with Sequel Pro. Open all the WordPress instances in Chrome. Install my plugins. Everything perfect. And the synced folders… awesome, they were allowing me to comfortably develop in my beloved IDE on my Mac.

I knew there was something wrong, though. I still hadn’t received any welcome emails from the WordPress instances. Well, I thought, that’s understandable. The install script never asked for my email… there must be some bogus one configured. I checked and I was right.

After entering my real email address in a WordPress instance, I triggered an email notification and waited. And waited. And waited. Nothing. Ever. Delivered. Then I checked the log.

Don’t get fooled by the message “You have new mail.” Those use to be internal notifications, like bounces.

That must be the previous problem: ´Host or domain name not found. Name service error for name=local.dev type=AAAA: Host not found´.

That must be the current problem: ´connect to gmail.com[]:25: Connection timed out´.

As much as I could understand, somewhere internally the port 25 might be closed, basically according to the answer given by kasperd here. Then I found this article from 2014 (which is really this article from 2008) and tried it out with sendmail.

Which again didn’t deliver anything but added these lines to the mail log.

Then the problem became: ´SASL authentication failed´. This one was pretty hard. I tried many different configurations. In the end I had to give up and read what Google advised right in the error message. That in turn made me follow (very reluctantly) the steps for Allowing less secure apps to access your account.

Notice that you are supposed to Allow less secure apps from the specific gmail account that you want to use for authenticating (through the server) when trying to relay, i.e. the one you set up into ´/etc/postfix/sasl_passwd´. For me that meant I had to open the link from the Chrome user associated to that account.

And as soon as I completed that LAST step, I started receiving all the messages that Ubuntu had been queueing after I had sent them. Then we both went to sleep.