Recently I’ve been studying code of JSON encoders for PHP strings, and I’ve discovered the solidus issue.
As a side note, this was the first time I saw a slash called a solidus, and a backslash called a reverse solidus: I always learn something new 😉
So the solidus issue is: Am I required to escape any slash in a JSON string?
Let’s see what Douglas Crockford specifies in the RFC4627:
2.5. Strings The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "u005C". Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\". To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "uD834uDD1E". Crockford Informational [Page 4] RFC 4627 JSON July 2006 string = quotation-mark *char quotation-mark char = unescaped / escape ( %x22 / ; " quotation mark U+0022 %x5C / ; reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 / ; t tab U+0009 %x75 4HEXDIG ) ; uXXXX U+XXXX escape = %x5C ; quotation-mark = %x22 ; " unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
I must say that the above string grammar is perfect. It tells everything one needs to know about JSON valid strings.
On the contrary the introductory notes are a bit confusing. I think all the Strings chapter could be rewritten like this:
2.5 Strings
The representation of strings is similar to conventions used in the C family of programming languages.
A string is a sequence of characters wrapped in double quotes. A backslash is always related to the following character. Only a few characters can follow a backslash: some retain their literal meaning, some do not.
All the valid sequences of a backslash followed by a character (except unicodes) are:
" which means the same as u0022 (double quote) \ which means the same as u005C (backslash) / which means the same as u002F (slash) b which means the same as u0008 (backspace) f which means the same as u000C (form feed) n which means the same as u000A (line feed) r which means the same as u000D (carriage return) t which means the same as u0009 (tab)Any character inside the Unicode Basic Multilingual Plane (U+0000 through U+FFFF) may also appear as a sequence of six characters: a backslash, followed by the lowercase letter u, followed by four hexadecimal digits (upper or lowercase) for the character’s code point. So, for example, a string containing only a single backslash may appear as “u005C”.
Any character outside the Unicode Basic Multilingual Plane may also appear as a sequence of twelve characters, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may appear as “uD834uDD1E”.
In the following grammar, assume that %x introduces a UTF-8 encoded character whose hexadecimal code follows %x.
string = "*char" " = %x22 char = escaped | standard | unicode escaped = same | special = %x5C same = " | | / / = %x2F special = b | f | n | r | t b = %x62 f = %x66 n = %x6E r = %x72 t = %x74 standard = %x20 | %x21 | %x23 .. %x5B | %x5D .. %x10FFFF unicode = u0000 .. uFFFF u = %x75
Now it should be clear that no backslash is required before a slash in a JSON string, but if a backslash is provided it’s still a valid string. This is very clear if we look at the example that Douglas Crockford gives in the same RFC, where no slash is escaped in the given Url value:
8. Examples This is a JSON object: { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": "100" }, "IDs": [116, 943, 234, 38793] Crockford Informational [Page 7] RFC 4627 JSON July 2006 } } Its Image member is an object whose Thumbnail member is an object and whose IDs member is an array of numbers. This is a JSON array containing two objects: [ { "precision": "zip", "Latitude": 37.7668, "Longitude": -122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "Latitude": 37.371991, "Longitude": -122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ]
The reason for allowing the slash to be escaped is for making it safe to embed the JSON substring “</script>” in HTML. By writing “<\/script>” one can be sure that the browser won’t mistake it for the closing script tag of the current embedded script.