Recently I’ve been studying code of JSON encoders for PHP strings, and I’ve discovered the solidus issue.
As a side note, this was the first time I saw a slash called a solidus, and a backslash called a reverse solidus: I always learn something new 😉
So the solidus issue is: Am I required to escape any slash in a JSON string?
Let’s see what Douglas Crockford specifies in the RFC4627:
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\\".
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
Crockford Informational [Page 4]
RFC 4627 JSON July 2006
string = quotation-mark *char quotation-mark
char = unescaped /
%x22 / ; " quotation mark U+0022
%x5C / ; reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
escape = %x5C ;
quotation-mark = %x22 ; "
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
I must say that the above string grammar is perfect. It tells everything one needs to know about JSON valid strings.
On the contrary the introductory notes are a bit confusing. I think all the Strings chapter could be rewritten like this:
The representation of strings is similar to conventions used in the C family of programming languages.
A string is a sequence of characters wrapped in double quotes. A backslash is always related to the following character. Only a few characters can follow a backslash: some retain their literal meaning, some do not.
All the valid sequences of a backslash followed by a character (except unicodes) are:
" which means the same as u0022 (double quote)
\ which means the same as u005C (backslash)
/ which means the same as u002F (slash)
b which means the same as u0008 (backspace)
f which means the same as u000C (form feed)
n which means the same as u000A (line feed)
r which means the same as u000D (carriage return)
t which means the same as u0009 (tab)
Any character inside the Unicode Basic Multilingual Plane (U+0000 through U+FFFF) may also appear as a sequence of six characters: a backslash, followed by the lowercase letter u, followed by four hexadecimal digits (upper or lowercase) for the character’s code point. So, for example, a string containing only a single backslash may appear as “u005C”.
Any character outside the Unicode Basic Multilingual Plane may also appear as a sequence of twelve characters, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may appear as “uD834uDD1E”.
In the following grammar, assume that %x introduces a UTF-8 encoded character whose hexadecimal code follows %x.
string = "*char"
" = %x22
char = escaped | standard | unicode
escaped = same | special
same = " | | /
/ = %x2F
special = b | f | n | r | t
b = %x62
f = %x66
n = %x6E
r = %x72
t = %x74
standard = %x20 | %x21 | %x23 .. %x5B | %x5D .. %x10FFFF
unicode = u0000 .. uFFFF
u = %x75
Now it should be clear that no backslash is required before a slash in a JSON string, but if a backslash is provided it’s still a valid string. This is very clear if we look at the example that Douglas Crockford gives in the same RFC, where no slash is escaped in the given Url value:
This is a JSON object:
"Title": "View from 15th Floor",
"IDs": [116, 943, 234, 38793]
Crockford Informational [Page 7]
RFC 4627 JSON July 2006
Its Image member is an object whose Thumbnail member is an object
and whose IDs member is an array of numbers.
This is a JSON array containing two objects:
"City": "SAN FRANCISCO",
The reason for allowing the slash to be escaped is for making it safe to embed the JSON substring “</script>” in HTML. By writing “<\/script>” one can be sure that the browser won’t mistake it for the closing script tag of the current embedded script.