BDE 4.14.0 Production release
|
Provide fast, safe conversion between UTF-8 and UTF-16 encodings.
This component provides a suite of static functions supporting the fast conversion of valid UTF-8 encoded strings to valid UTF-16 16-bit word arrays, wstrings, and vectors, and conversion of valid UTF-16 encoded word sequences to valid UTF-8 byte arrays, strings, and byte vectors. Invalid byte sequences and code points forbidden by either encoding are removed and (optionally) replaced by a single word or byte provided by the caller. In UTF-16 -> UTF-8 conversion, the replacement word must be a non-zero byte, in the other direction, it must be a single, non-zero word. The byte or word count and code point count that are optionally returned through pointer arguments include the terminating null code point in their count. The byte order of the UTF-16 input or output can be specified via the optional byteOrder
argument, which is assumed to be host byte order if not specified. In functions taking UTF-8, input is in the form of a bslstl::StringRef
or a null-terminated const char *
. In functions taking UTF-16, input is either in the form of a bslstl::StringRefWide
or a pointer to a null-terminated array of unsigned short
or wchar_t
.
UTF-8 is an encoding that allows 32-bit character sets like Unicode to be represented using (8-bit) byte strings, while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 3629 (http://www.ietf.org/rfc/rfc3629.txt).
UTF-16 is a 16-bit encoding that allows Unicode code points up to 0x10ffff to be encoded using one or two 16-bit values. Note that UTF-16 is described in detail in RFC 2781 (http://www.ietf.org/rfc/rfc2781.txt).
The functions here that translate to fixed buffers make a single pass through the data. The functions that translate to bsl::string
s and STL containers, however, like the glib
conversion routines, make two passes: a size estimation pass, after which the output container is sized appropriately, and then the translation pass.
The methods that output to a vector
, string
, or wstring
will all grow the output object as necessary to fit the data, and in the end will exactly resize the object to the output (including the terminating 0 for vector
, which is not included for string
or wstring
). Note that in the case of string
or wstring
, the terminating 0 code point is still included in the code point count.
Non-minimal UTF-8 encodings of code points are reported as errors. Octets and post-conversion code points in the forbidden ranges are treated as errors and removed (or replaced, if a replacement word is provided).
UTF-16 (or UTF-8, for that matter) can be stored in wstring
s, but note that the size of a wstring::value_type
, also known as a wchar_t
word, varies across different platforms – it is 4 bytes on Solaris, Linux, and Darwin, and 2 bytes on AIX and Windows. So a file of wchar_t
words written by one platform may not be readable by another. Byte order is also a consideration, and a non-host byte order can be handled by using the optional byteOrder
argument of these functions. Another factor is that, since UTF-16 words all fit in 2 bytes, using wchar_t
to store UTF-16 is very wasteful of space on many platforms.
This section illustrates intended use of this component.
In this example, we will translate a string containing a non-ASCII code point from UTF-16 to UTF-8 and back using fixed-length buffers.
First, we create a UTF-16 string spelling ecole
in French, which begins with 0xc9
, a non-ASCII e
with an accent over it:
Then, we create a byte buffer to store the UTF-8 result of the translation in, and variables to monitor counts of code points and bytes translated:
Next, we call utf16ToUtf8
to do the translation:
Then, we observe that no errors or warnings occurred, and that the numbers of code points and bytes were as expected. Note that both numCodePoints
and numBytes
include the terminating 0:
Next, we examine the length of the translated string:
Then, we examine the individual bytes of the translated UTF-8:
Next, in preparation for translation back to UTF-16, we create a buffer of short
values and the variable numWords
to track the number of UTF-16 words occupied by the result:
Then, we do the reverse translation:
Next, we observe that no errors or warnings were reported, and that the number of code points and words were as expected. Note that numCodePoints
and numWords
both include the terminating 0:
Now, we observe that our output is identical to the original UTF-16 string:
Finally, we examine the individual words of the reverse translation:
The following snippets of code illustrate a typical use of the bdlde::CharConvertUtf16
struct's utility functions, first converting from UTF-8 to UTF-16, and then converting back to make sure the round trip returns the same value, translating to STL containers in both directions.
First, we declare a string of UTF-8 containing single-, double-, triple-, and quadruple-octet code points:
Then, we declare an enum
summarizing the counts of code points in the string and verify that the counts add up to the length of the string:
Next, we declare the vector where our UTF-16 output will go, and a variable into which the number of code points (not bytes or words) written will be stored. It is not necessary to initialize utf16CodePointsWritten
:
Note that for performance, we should v16.reserve(sizeof(utf8MultiLang))
, but it's not strictly necessary – the vector will automatically be grown to the correct size. Also note that if v16
were not empty, that wouldn't be a problem – any contents will be discarded.
Then, we do the translation to UTF-16:
Next, we verify that the number of code points (not bytes or words) that was returned is correct:
Then, we verify that the number of 16-bit words written was correct. The quad octet code points each require 2 short
words of output:
Next, we calculate and confirm the difference between the number of UTF-16 words output and the number of bytes input. The ASCII code points will take 1 16-bit word apiece, the Greek code points are double octets that will become single short
values, the Chinese code points are encoded as UTF-8 triple octets that will turn into single 16-bit words, the same for the Hindi code points, and the quad code points are quadruple octets that will turn into double short
values:
Then, we go on to do the reverse utf16ToUtf8
transform to turn it back into UTF-8, and we should get a result identical to our original input. We declare a bsl::string
for our output, and a variable to count the number of code points (not bytes or words) translated:
Again, note that for performance, we should ideally s.reserve(3 * v16.size())
but it's not really necessary.
Now, we do the reverse transform:
Finally, we verify that a successful status was returned, that the output of the reverse transform was identical to the original input, and that the number of code points translated was as expected: