Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bdlde_charconvertutf32
[Package bdlde]

Provide fast, safe conversion between UTF-8 encoding and UTF-32. More...

Namespaces

namespace  bdlde

Detailed Description

Outline
Purpose:
Provide fast, safe conversion between UTF-8 encoding and UTF-32.
Classes:
bdlde::CharConvertUtf32 namespace for conversion between UTF-8 and UTF-32
Description:
This component provides a struct, bdlde::CharConvertUtf32, that provides a suite of static functions supporting the fast conversion of UTF-8 data to UTF-32, and vice versa. UTF-8 input can take the form of null-terminated "C" strings or bsl::string_views, while UTF-32 input can only take the form of null-terminated buffers of unsigned int. Output can be to STL vectors, bsl::strings (in the case of UTF-8), and fixed-length buffers. Invalid byte sequences and code points forbidden by either encoding are removed and (optionally) replaced by an error byte or word provided by the caller. The byte order of the UTF-32 input or output can be specified via the optional byteOrder argument, which is assumed to be host byte order if not specified. The byte or word count and code point count that are optionally returned through pointer arguments include the terminating null byte or word.
History and Motivation:
UTF-8 is a Unicode encoding that allows 32-bit Unicode to be represented using null-terminated (8-bit) byte strings, while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 3629 (http://www.ietf.org/rfc/rfc3629.txt).
UTF-32 is simply a name for storing raw Unicode values as sequential unsigned int values in memory.
Valid Unicode values are in the ranges [ 1 .. 0xd7ff ] and [ 0xe000 .. 0x10ffff ]. The value 0 is used to terminate sequences.
The functions here that translate to fixed buffers make a single pass through the data. The functions that translate to bsl::strings and bsl::vectors, however, like the glib conversion routines, make two passes: a size estimation pass, after which the output container is sized appropriately, and then the translation pass.
The methods that output to a vector or string will all grow the output object as necessary to fit the data, and in the end will exactly resize the object to the output (including the terminating 0 for vector, not including it for string). The resizing will not affect the capacity.
Non-minimal UTF-8 encodings of code points are reported as errors. Octets and post-conversion code points in the forbidden ranges are treated as errors and removed if 0 is specified as errorWord, or replaced with errorWord otherwise.
Usage:
In this section we show intended use of this component.
Example: Round-Trip Multi-Lingual Conversion:
The following snippets of code illustrate a typical use of the bdlde::CharConvertUtf32 struct's utility functions, first converting from UTF-8 to UTF-32, and then converting back to make sure the round trip returns the same value.
First, we declare a string of UTF-8 containing single-, double-, triple-, and quadruple-octet code points:
  const char utf8MultiLang[] = {
      "Hello"                                         // -- ASCII
      "\xce\x97"         "\xce\x95"       "\xce\xbb"  // -- Greek
      "\xe4\xb8\xad"     "\xe5\x8d\x8e"               // -- Chinese
      "\xe0\xa4\xad"     "\xe0\xa4\xbe"               // -- Hindi
      "\xf2\x94\xb4\xa5" "\xf3\xb8\xac\x83" };        // -- Quad octets
Then, we declare an enum summarizing the counts of code points in the string and verify that the counts add up to the length of the string:
  enum { NUM_ASCII_CODE_POINTS   = 5,
         NUM_GREEK_CODE_POINTS   = 3,
         NUM_CHINESE_CODE_POINTS = 2,
         NUM_HINDI_CODE_POINTS   = 2,
         NUM_QUAD_CODE_POINTS    = 2 };

  assert(1 * NUM_ASCII_CODE_POINTS +
         2 * NUM_GREEK_CODE_POINTS +
         3 * NUM_CHINESE_CODE_POINTS +
         3 * NUM_HINDI_CODE_POINTS +
         4 * NUM_QUAD_CODE_POINTS == bsl::strlen(utf8MultiLang));
Next, we declare the vector where our UTF-32 output will go, and a variable into which the number of code points written will be stored. It is not necessary to create a utf32CodePointsWritten variable, since the number of code points will be the size of the vector when we are done. Note that it is a waste of time to v32.reserve(sizeof(utf8MultiLang)); it is entirely redundant -- v32 will automatically be grown to the correct size. Also note that if v32 were not empty, that would not be a problem -- any contents will be discarded.
Then, we do the translation to UTF-32:
  int retVal = bdlde::CharConvertUtf32::utf8ToUtf32(&v32,
                                                    utf8MultiLang);

  assert(0 == retVal);        // verify success
  assert(0 == v32.back());    // verify null terminated
Next, we verify that the number of code points that was returned is correct. Note that in UTF-32, the number of Unicode code points written is the same as the number of 32-bit words written:
  enum { EXPECTED_CODE_POINTS_WRITTEN =
                  NUM_ASCII_CODE_POINTS +
                  NUM_GREEK_CODE_POINTS +
                  NUM_CHINESE_CODE_POINTS +
                  NUM_HINDI_CODE_POINTS +
                  NUM_QUAD_CODE_POINTS  + 1 };
  assert(EXPECTED_CODE_POINTS_WRITTEN == v32.size());
Next, we calculate and confirm the difference between the number of UTF-32 words output and the number of bytes input. The ASCII bytes will take 1 32-bit word apiece, the Greek code points are double octets that will become single unsigned int values, the Chinese code points are encoded as UTF-8 triple octets that will turn into single 32-bit words, the same for the Hindi code points, and the quad code points are quadruple octets that will turn into single unsigned int words:
  enum { SHRINKAGE =
                    NUM_ASCII_CODE_POINTS   * (1-1) +
                    NUM_GREEK_CODE_POINTS   * (2-1) +
                    NUM_CHINESE_CODE_POINTS * (3-1) +
                    NUM_HINDI_CODE_POINTS   * (3-1) +
                    NUM_QUAD_CODE_POINTS    * (4-1) };

  assert(v32.size() == sizeof(utf8MultiLang) - SHRINKAGE);
Then, we go on to do the reverse utf32ToUtf8 transform to turn it back into UTF-8, and we should get a result identical to our original input. Declare a bsl::string for our output, and a variable to count the number of code points translated:
  bsl::string s;
  bsl::size_t codePointsWritten;
Again, note that it would be a waste of time for the caller to resize or reserve v32; it will be automatically resized by the translator to the right length.
Now, we do the reverse transform:
  retVal = bdlde::CharConvertUtf32::utf32ToUtf8(&s,
                                                v32.begin(),
                                                &codePointsWritten);
Finally, we verify that a successful status was returned, that the output of the reverse transform was identical to the original input, and that the number of code points translated was as expected:
  assert(0 == retVal);
  assert(utf8MultiLang  == s);
  assert(s.length() + 1 == sizeof(utf8MultiLang));

  assert(EXPECTED_CODE_POINTS_WRITTEN == codePointsWritten);
  assert(v32.size()                   == codePointsWritten);