BDE 4.14.0 Production release
|
Provide basic utilities for UTF-8 encodings.
This component provides, within the bdlde::Utf8Util
struct
, a suite of static functions supporting UTF-8 encoded strings. Two interfaces are provided for each function, one where the length of the string (in bytes) is passed as a separate argument, and one where the string is passed as a null-terminated C-style string.
A string is deemed to contain valid UTF-8 if it is compliant with RFC 3629, meaning that only 1-, 2-, 3-, and 4-byte sequences are allowed. Values above U+10ffff
are also not allowed.
Seven types of functions are provided:
isValid
, which checks for validity, per RFC 3629, of a (candidate) UTF-8 string. "Overlong values", that is, values encoded in more bytes than necessary, are not tolerated; nor are "surrogate values", which are values in the range [U+d800 .. U+dfff]
.advanceIfValid
and advanceRaw
, which advance some number of Unicode code points, each of which may be encoded in multiple bytes in a UTF-8 string. advanceRaw
assumes the string is valid UTF-8, while advanceIfValid
checks the input for validity and stops advancing if a sequence is encountered that is not valid UTF-8.numCodePointsIfValid
and numCodePointsRaw
, which return the number of Unicode code points in a UTF-8 string. Note that numCodePointsIfValid
both validates a (candidate) UTF-8 string and counts the number of Unicode code points that it contains.numBytesIfValid
, which returns the number of bytes a specified number of Unicode code points occupy in a UTF-8 string.getByteSize
, which returns the length of a single UTF-8 encoded character.CodePointValue
, which returns the integral value of a single UTF-8 encoded character.appendUtf8Character
, which appends a single Unicode code point to a UTF-8 string.Embedded null bytes are allowed in strings that are accompanied by an explicit length argument. Naturally, null-terminated C-style strings cannot contain embedded null code points.
The UTF-8 format is described in the RFC 3629 document at:
and in Wikipedia at:
The utility functions provided by this component consider the empty string to be valid UTF-8. For those functions that take input as a (pointer, length)
pair, if 0 == pointer
and 0 == length
, then the input is interpreted as a valid, empty string. However, if 0 == pointer
and 0 != length
, the behavior is undefined. All such functions have a counterpart that takes a lone pointer to a null-terminated (C-style) string. The behavior is always undefined if 0 is supplied for that lone pointer.
This section illustrates intended use of this component.
In this usage example, we will encode some Unicode code points in UTF-8 strings and demonstrate those that are valid and those that are not.
First, we build an unquestionably valid UTF-8 string:
Then, we check its validity and measure its length:
Next, we encode a lone surrogate value, 0xd8ab
, that we encode as the raw 3-byte sequence "\xed\xa2\xab" to avoid validation:
Then, we cannot use numCodePointsRaw
to count the code points in stringWithSurrogate
, since the behavior of that method is undefined unless the string is valid. Instead, the numCodePointsIfValid
method can be used on strings whose validity we are uncertain of:
Now, we encode 0, which is allowed. However, note that we cannot use any interfaces that take a null-terminated string for this case:
Finally, we encode 0x3a
(:
) as an overlong value using 2 bytes, which is not valid UTF-8 (since :
can be "encoded" in 1 byte):
In this example, we will use the various advance
functions to advance through a UTF-8 string.
First, build the string using appendUtf8CodePoint
, keeping track of how many bytes are in each Unicode code point:
Then, declare a few variables we'll need:
Next, try advancing 2 code points, then 3, then 4, observing that the value returned is the number of Unicode code points advanced. Note that since we're only advancing over valid UTF-8, we can use either advanceRaw
or advanceIfValid
:
Then, try advancing by more code points than are present using advanceIfValid
, and wind up stopping when we encounter invalid input. The behavior of advanceRaw
is undefined if it is used on invalid input, so we cannot use it here. Also note that we will stop at the beginning of the invalid Unicode code point, and not at the first incorrect byte, which is two bytes later:
Now, doctor the string to replace the invalid code point with a valid one, so the string is entirely correct UTF-8:
Finally, advance using both functions by more code points than are in the string and in both cases wind up at the end of the string. Note that advanceIfValid
does not return an error (non-zero) value to status
when it encounters the end of the string:
In this usage example, we will demonstrate reading and validating UTF-8 from a stream.
We write a function to read valid UTF-8 to a bsl::string
. We don't know how long the input will be, so we don't know how long to make the string before we start. We will grow the string in small, 32-byte increments.