BDE 4.14.0 Production release
|
Provide fast, safe conversion between UTF-8 encoding and UTF-32.
This component provides a struct
, bdlde::CharConvertUtf32
, that provides a suite of static functions supporting the fast conversion of UTF-8 data to UTF-32, and vice versa. UTF-8 input can take the form of null-terminated "C" strings or bsl::string_view
s, while UTF-32 input can only take the form of null-terminated buffers of unsigned int
. Output can be to STL vectors, bsl::string
s (in the case of UTF-8), and fixed-length buffers. Invalid byte sequences and code points forbidden by either encoding are removed and (optionally) replaced by an error byte or word provided by the caller. The byte order of the UTF-32 input or output can be specified via the optional byteOrder
argument, which is assumed to be host byte order if not specified. The byte or word count and code point count that are optionally returned through pointer arguments include the terminating null byte or word.
UTF-8 is a Unicode encoding that allows 32-bit Unicode to be represented using null-terminated (8-bit) byte strings, while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 3629 (http://www.ietf.org/rfc/rfc3629.txt).
UTF-32 is simply a name for storing raw Unicode values as sequential unsigned int
values in memory.
Valid Unicode values are in the ranges [ 1 .. 0xd7ff ]
and [ 0xe000 .. 0x10ffff ]
. The value 0
is used to terminate sequences.
The functions here that translate to fixed buffers make a single pass through the data. The functions that translate to bsl::string
s and bsl::vector
s, however, like the glib
conversion routines, make two passes: a size estimation pass, after which the output container is sized appropriately, and then the translation pass.
The methods that output to a vector
or string
will all grow the output object as necessary to fit the data, and in the end will exactly resize the object to the output (including the terminating 0 for vector
, not including it for string
). The resizing will not affect the capacity.
Non-minimal UTF-8 encodings of code points are reported as errors. Octets and post-conversion code points in the forbidden ranges are treated as errors and removed if 0 is specified as errorWord
, or replaced with errorWord
otherwise.
This section illustrates intended use of this component.
The following snippets of code illustrate a typical use of the bdlde::CharConvertUtf32
struct's utility functions, first converting from UTF-8 to UTF-32, and then converting back to make sure the round trip returns the same value.
First, we declare a string of UTF-8 containing single-, double-, triple-, and quadruple-octet code points:
Then, we declare an enum
summarizing the counts of code points in the string and verify that the counts add up to the length of the string:
Next, we declare the vector where our UTF-32 output will go, and a variable into which the number of code points written will be stored. It is not necessary to create a utf32CodePointsWritten
variable, since the number of code points will be the size of the vector when we are done.
Note that it is a waste of time to v32.reserve(sizeof(utf8MultiLang))
; it is entirely redundant – v32
will automatically be grown to the correct size. Also note that if v32
were not empty, that would not be a problem – any contents will be discarded.
Then, we do the translation to UTF-32
:
Next, we verify that the number of code points that was returned is correct. Note that in UTF-32, the number of Unicode code points written is the same as the number of 32-bit words written:
Next, we calculate and confirm the difference between the number of UTF-32 words output and the number of bytes input. The ASCII bytes will take 1 32-bit word apiece, the Greek code points are double octets that will become single unsigned int
values, the Chinese code points are encoded as UTF-8 triple octets that will turn into single 32-bit words, the same for the Hindi code points, and the quad code points are quadruple octets that will turn into single unsigned int
words:
Then, we go on to do the reverse utf32ToUtf8
transform to turn it back into UTF-8, and we should get a result identical to our original input. Declare a bsl::string
for our output, and a variable to count the number of code points translated:
Again, note that it would be a waste of time for the caller to resize
or reserve
v32
; it will be automatically resize
d by the translator to the right length.
Now, we do the reverse transform:
Finally, we verify that a successful status was returned, that the output of the reverse transform was identical to the original input, and that the number of code points translated was as expected: