Recently I’ve been studying code of JSON encoders for PHP strings, and I’ve discovered the solidus issue.
As a side note, this was the first time I saw a slash called a solidus, and a backslash called a reverse solidus: I always learn something new 😉
So the solidus issue is: Am I required to escape any slash in a JSON string?
Let’s see what Douglas Crockford specifies in the RFC4627:
2.5. Strings The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "u005C". Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\". To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "uD834uDD1E". Crockford Informational [Page 4] RFC 4627 JSON July 2006 string = quotation-mark *char quotation-mark char = unescaped / escape ( %x22 / ; " quotation mark U+0022 %x5C / ; reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 / ; t tab U+0009 %x75 4HEXDIG ) ; uXXXX U+XXXX escape = %x5C ; quotation-mark = %x22 ; " unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
I must say that the above string grammar is perfect. It tells everything one needs to know about JSON valid strings.
On the contrary the introductory notes are a bit confusing. I think all the Strings chapter could be rewritten like this:
2.5 Strings
The representation of strings is similar to conventions used in the C family of programming languages.
A string is a sequence of characters wrapped in double quotes. A backslash is always related to the following character. Only a few characters can follow a backslash: some retain their literal meaning, some do not.
All the valid sequences of a backslash followed by a character (except unicodes) are:
" which means the same as u0022 (double quote) \ which means the same as u005C (backslash) / which means the same as u002F (slash) b which means the same as u0008 (backspace) f which means the same as u000C (form feed) n which means the same as u000A (line feed) r which means the same as u000D (carriage return) t which means the same as u0009 (tab)Any character inside the Unicode Basic Multilingual Plane (U+0000 through U+FFFF) may also appear as a sequence of six characters: a backslash, followed by the lowercase letter u, followed by four hexadecimal digits (upper or lowercase) for the character’s code point. So, for example, a string containing only a single backslash may appear as “u005C”.
Any character outside the Unicode Basic Multilingual Plane may also appear as a sequence of twelve characters, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may appear as “uD834uDD1E”.
In the following grammar, assume that %x introduces a UTF-8 encoded character whose hexadecimal code follows %x.
string = "*char" " = %x22 char = escaped | standard | unicode escaped = same | special = %x5C same = " | | / / = %x2F special = b | f | n | r | t b = %x62 f = %x66 n = %x6E r = %x72 t = %x74 standard = %x20 | %x21 | %x23 .. %x5B | %x5D .. %x10FFFF unicode = u0000 .. uFFFF u = %x75
Now it should be clear that no backslash is required before a slash in a JSON string, but if a backslash is provided it’s still a valid string. This is very clear if we look at the example that Douglas Crockford gives in the same RFC, where no slash is escaped in the given Url value:
8. Examples This is a JSON object: { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": "100" }, "IDs": [116, 943, 234, 38793] Crockford Informational [Page 7] RFC 4627 JSON July 2006 } } Its Image member is an object whose Thumbnail member is an object and whose IDs member is an array of numbers. This is a JSON array containing two objects: [ { "precision": "zip", "Latitude": 37.7668, "Longitude": -122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "Latitude": 37.371991, "Longitude": -122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ]
The reason for allowing the slash to be escaped is for making it safe to embed the JSON substring “</script>” in HTML. By writing “<\/script>” one can be sure that the browser won’t mistake it for the closing script tag of the current embedded script.
Sigh. 🙂
I stumbled upon your blog from a stackexchange question, while looking into the issue. I am looking forward to reading it, but first I would like to point out, that your quote from RFC 4627 is mangled.
This seems to be a recurring curse in all kinds of systems using text that can contain in-line encodings of characters. The RFC says for example:
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as “\”.
There should be two backslashes in double quotes before the last period. Now I am most curious whether my comment will be mangled too. 🙂
/Lasse
Good catch, Lasse. Always struggling against WordPress mangling…
Actually, dear Lasse, my problem was MUCH bigger (sigh). I discovered this morning all my escaping backslashes have been unexpectedly reduced by a unit, site-wide. I don’t know how or when did it happen, but hopefully I found an old backup with correct text, and I’m now restoring it by comparing MySQL dumps, a real challenge. 🙁
OK.. I’ve been able to restore the old backslashes from my old backup.. 1.5 years ago, quite old.
As for the rest of the backslashes, I’ll review each new post I published and see what I can do.
Sometimes I Google my name to see what kind of marks I have left around, and your blog came up – I am glad to see that my comment was informational or maybe even useful to you. Even now, five years later, this problem is probably still haunting the world, with no end in sight.
Hi,
a small typo fix:
at the article line
represented as “u005C”
shall be
represented as “u005C”
with slash before u