Provide a utility functions for JSON strings.
More...
Detailed Description
- Outline
-
-
- Purpose:
- Provide a utility functions for JSON strings.
-
- Classes:
-
-
- Description:
- This component defines a utility
struct
, bdljsn::StringUtil
, that is a namespace for functions that convert arbitrary UTF-8 codepoint sequences to JSON strings and vice versa. The rules for these conversions are outlined below in JSON Strings and detailed in: https://www.rfc-editor.org/rfc/rfc8259#section-7 (RFC8259)
- This utility provides two key functions:
-
writeString
: Given an arbitrary UTF-8 codepoint sequence, generate a JSON string representing the same codepoints.
-
readString
: Given a JSON string (e.g., the output of writeString
), generate the equivalent sequence of UTF-8 code points.
- When using these functions, a UTF-8 codepoint sequence is always preserved on the round trip to JSON string and back; however, since there are equivalent allowed representations of a JSON string, the converse is not guaranteed.
-
- JSON Strings:
- JSON strings consist of UTF-8 codepoints surround by double quotes (i.e.,
\"
) Within those double quotes certain characters must be escaped (i.e., replaced with some alternative, multi-byte representation). Those characters are:
-
quotation marks
-
backslashes (a.k.a., a "reverse solidus")
-
the "control characters" in the range
U+0000
to U+001F
(inclusive).
- Each of the above characters can be escaped by replacing it with the six byte sequence consisting of:
-
a backslash,
-
a lower-case
u
, and
-
the Unicode value expressed as four hexadecimal digits.
- For example, the character that rings the console bell is represented as
\u0007
. Note that the hexadecimal digits can use upper or lower case letters but the lead u
character must be lower case. See Strictness.
- Eight of the characters that must be escaped can be alternatively represented by special, 2-byte sequences:
+---------+-----------------+---------------+---------------+
| Unicode | Description | 6-byte escape | 2-byte escape |
+---------+-----------------+---------------+---------------+
| U+0022 | quotation mark | \u0022 | \" |
| U+005C | backslash | \u005c | \\ |
| U+002F | slash | \u002f | \/ |
| U+0008 | backspace | \u0008 | \b |
| U+000C | form feed | \u000C | \f |
| U+000A | line feed | \u000A | \n |
| U+000D | carriage return | \u000D | \r |
| U+0009 | tab | \u0009 | \t |
+---------+-----------------+---------------+---------------+
Note that the above set is similar to but not identical to the set of two byte char
literals supported by C++. For example, \0
(null) and \a
(bell) are not included above.
-
- Guarantees: Arbitrary UTF-8 to JSON String:
-
No UTF-8 characters in the Basic Multilingual Plane are escaped unless they are in the set that must be escaped.
-
When a character must be escaped, the 6-byte (hexadecimal) representation is used only if no 2-byte escape exists.
-
When a 6-byte (hexadecimal) representation is used, hexadecimal letters are in upper case.
-
All UTF-8 characters outside of the Basic Multilingual Plane are represented by two, adjacent 6-byte hexadecimal escape sequences. For details, see: https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
-
- Strictness:
- By default, the
bdljsn::StringUtil
read and write methods strictly follow the RFC8259 standard. Variances from those rules are expressed using bdljsn::StringUtil::FLags
, an enum
of flag values that can be set in the optional flags
parameter of the decoding methods. Multiple flags can be bitwise set in 'flags; however, currently, just one variance flag is defined.
-
- Example Variance:
- RFC8259 specifies that the 6-byte Unicode escape sequence start with a slash,
/
, and lower-case u
. However, if the bdljsn::StringUtil::e_ACCEPT_CAPITAL_UNICODE_ESCAPE
is set, an upper-case U
is accepted as well. Thus, both \u0007
and \U0007
would be interpreted as the BELL character.
-
- Usage:
- This section illustrates intended use of this component.
-
- Example 1: Encoding and Decoding a JSON String:
- First, we initialize a string with a valid sequence of UTF-8 codepoints. Notice that, as required by C++ syntax, several characters are represented by their two-character escape sequence: double quote (twice), bell, and newline.
- Then, we examine the string as output:
bsl::cout << initial << bsl::endl;
and observe: Does the name "Ivan Pavlov" ring a bell?
Notice that the backslash characters (having served their purpose of giving special meaning to the subsequent character) are not shown. The BELL and NEWLINE characters are output but are not visible.
- Now, we generate JSON string equivalent of the
initial
string. and observed how the initial
string is represented for JSON: "Does the name \"Ivan Pavlov\" ring a bell\u0007?\n"
Notice that:
-
The entire string is delimited by double quotes.
-
The interior double quotes and new line are represented by two character escape sequences (as they were in the C++ string literal.
-
Since JSON does not have a two character escape sequence for the BELL character,
\u0007
, the 6-byte Unicode representation is used.
- Finally, we convert the
jsonCompatibleString
back to its original content: bsl::string fromJsonString;
const int rcDecode = bdljsn::StringUtil::readString(
&fromJsonString,
jsonCompatibleString);
assert(0 == rcDecode);
assert(initial == fromJsonString);
bsl::cout << fromJsonString << bsl::endl;
and observe (again): Does the name "Ivan Pavlov" ring a bell?