BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlde_charconvertutf32

Detailed Description

Outline

Purpose

Provide fast, safe conversion between UTF-8 encoding and UTF-32.

Classes

Description

This component provides a struct, bdlde::CharConvertUtf32, that provides a suite of static functions supporting the fast conversion of UTF-8 data to UTF-32, and vice versa. UTF-8 input can take the form of null-terminated "C" strings or bsl::string_views, while UTF-32 input can only take the form of null-terminated buffers of unsigned int. Output can be to STL vectors, bsl::strings (in the case of UTF-8), and fixed-length buffers. Invalid byte sequences and code points forbidden by either encoding are removed and (optionally) replaced by an error byte or word provided by the caller. The byte order of the UTF-32 input or output can be specified via the optional byteOrder argument, which is assumed to be host byte order if not specified. The byte or word count and code point count that are optionally returned through pointer arguments include the terminating null byte or word.

History and Motivation

UTF-8 is a Unicode encoding that allows 32-bit Unicode to be represented using null-terminated (8-bit) byte strings, while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 3629 (http://www.ietf.org/rfc/rfc3629.txt).

UTF-32 is simply a name for storing raw Unicode values as sequential unsigned int values in memory.

Valid Unicode values are in the ranges [ 1 .. 0xd7ff ] and [ 0xe000 .. 0x10ffff ]. The value 0 is used to terminate sequences.

The functions here that translate to fixed buffers make a single pass through the data. The functions that translate to bsl::strings and bsl::vectors, however, like the glib conversion routines, make two passes: a size estimation pass, after which the output container is sized appropriately, and then the translation pass.

The methods that output to a vector or string will all grow the output object as necessary to fit the data, and in the end will exactly resize the object to the output (including the terminating 0 for vector, not including it for string). The resizing will not affect the capacity.

Non-minimal UTF-8 encodings of code points are reported as errors. Octets and post-conversion code points in the forbidden ranges are treated as errors and removed if 0 is specified as errorWord, or replaced with errorWord otherwise.

Usage

This section illustrates intended use of this component.

Example 1: Round-Trip Multi-Lingual Conversion

The following snippets of code illustrate a typical use of the bdlde::CharConvertUtf32 struct's utility functions, first converting from UTF-8 to UTF-32, and then converting back to make sure the round trip returns the same value.

First, we declare a string of UTF-8 containing single-, double-, triple-, and quadruple-octet code points:

const char utf8MultiLang[] = {
"Hello" // -- ASCII
"\xce\x97" "\xce\x95" "\xce\xbb" // -- Greek
"\xe4\xb8\xad" "\xe5\x8d\x8e" // -- Chinese
"\xe0\xa4\xad" "\xe0\xa4\xbe" // -- Hindi
"\xf2\x94\xb4\xa5" "\xf3\xb8\xac\x83" }; // -- Quad octets

Then, we declare an enum summarizing the counts of code points in the string and verify that the counts add up to the length of the string:

enum { NUM_ASCII_CODE_POINTS = 5,
NUM_GREEK_CODE_POINTS = 3,
NUM_CHINESE_CODE_POINTS = 2,
NUM_HINDI_CODE_POINTS = 2,
NUM_QUAD_CODE_POINTS = 2 };
assert(1 * NUM_ASCII_CODE_POINTS +
2 * NUM_GREEK_CODE_POINTS +
3 * NUM_CHINESE_CODE_POINTS +
3 * NUM_HINDI_CODE_POINTS +
4 * NUM_QUAD_CODE_POINTS == bsl::strlen(utf8MultiLang));

Next, we declare the vector where our UTF-32 output will go, and a variable into which the number of code points written will be stored. It is not necessary to create a utf32CodePointsWritten variable, since the number of code points will be the size of the vector when we are done.

Definition bslstl_vector.h:1025

Note that it is a waste of time to v32.reserve(sizeof(utf8MultiLang)); it is entirely redundant – v32 will automatically be grown to the correct size. Also note that if v32 were not empty, that would not be a problem – any contents will be discarded.

Then, we do the translation to UTF-32:

utf8MultiLang);
assert(0 == retVal); // verify success
assert(0 == v32.back()); // verify null terminated
reference back()
Definition bslstl_vector.h:2577
static int utf8ToUtf32(bsl::vector< unsigned int > *dstVector, const char *srcString, unsigned int errorWord='?', ByteOrder::Enum byteOrder=ByteOrder::e_HOST)

Next, we verify that the number of code points that was returned is correct. Note that in UTF-32, the number of Unicode code points written is the same as the number of 32-bit words written:

enum { EXPECTED_CODE_POINTS_WRITTEN =
NUM_ASCII_CODE_POINTS +
NUM_GREEK_CODE_POINTS +
NUM_CHINESE_CODE_POINTS +
NUM_HINDI_CODE_POINTS +
NUM_QUAD_CODE_POINTS + 1 };
assert(EXPECTED_CODE_POINTS_WRITTEN == v32.size());
size_type size() const BSLS_KEYWORD_NOEXCEPT
Return the number of elements in this vector.
Definition bslstl_vector.h:2664

Next, we calculate and confirm the difference between the number of UTF-32 words output and the number of bytes input. The ASCII bytes will take 1 32-bit word apiece, the Greek code points are double octets that will become single unsigned int values, the Chinese code points are encoded as UTF-8 triple octets that will turn into single 32-bit words, the same for the Hindi code points, and the quad code points are quadruple octets that will turn into single unsigned int words:

enum { SHRINKAGE =
NUM_ASCII_CODE_POINTS * (1-1) +
NUM_GREEK_CODE_POINTS * (2-1) +
NUM_CHINESE_CODE_POINTS * (3-1) +
NUM_HINDI_CODE_POINTS * (3-1) +
NUM_QUAD_CODE_POINTS * (4-1) };
assert(v32.size() == sizeof(utf8MultiLang) - SHRINKAGE);

Then, we go on to do the reverse utf32ToUtf8 transform to turn it back into UTF-8, and we should get a result identical to our original input. Declare a bsl::string for our output, and a variable to count the number of code points translated:

bsl::size_t codePointsWritten;
Definition bslstl_string.h:1281

Again, note that it would be a waste of time for the caller to resize or reserve v32; it will be automatically resized by the translator to the right length.

Now, we do the reverse transform:

v32.begin(),
&codePointsWritten);
iterator begin() BSLS_KEYWORD_NOEXCEPT
Definition bslstl_vector.h:2511
static int utf32ToUtf8(bsl::string *dstString, const unsigned int *srcString, bsl::size_t *numCodePointsWritten=0, unsigned char errorByte='?', ByteOrder::Enum byteOrder=ByteOrder::e_HOST)

Finally, we verify that a successful status was returned, that the output of the reverse transform was identical to the original input, and that the number of code points translated was as expected:

assert(0 == retVal);
assert(utf8MultiLang == s);
assert(s.length() + 1 == sizeof(utf8MultiLang));
assert(EXPECTED_CODE_POINTS_WRITTEN == codePointsWritten);
assert(v32.size() == codePointsWritten);
size_type length() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6601