Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bdlde_charconvertutf16
[Package bdlde]

Provide fast, safe conversion between UTF-8 and UTF-16 encodings. More...

Namespaces

namespace  bdlde

Detailed Description

Outline
Purpose:
Provide fast, safe conversion between UTF-8 and UTF-16 encodings.
Classes:
bdlde::CharConvertUtf16 namespace for conversions between UTF-8 and UTF-16
Description:
This component provides a suite of static functions supporting the fast conversion of valid UTF-8 encoded strings to valid UTF-16 16-bit word arrays, wstrings, and vectors, and conversion of valid UTF-16 encoded word sequences to valid UTF-8 byte arrays, strings, and byte vectors. Invalid byte sequences and code points forbidden by either encoding are removed and (optionally) replaced by a single word or byte provided by the caller. In UTF-16 -> UTF-8 conversion, the replacement word must be a non-zero byte, in the other direction, it must be a single, non-zero word. The byte or word count and code point count that are optionally returned through pointer arguments include the terminating null code point in their count. The byte order of the UTF-16 input or output can be specified via the optional byteOrder argument, which is assumed to be host byte order if not specified. In functions taking UTF-8, input is in the form of a bslstl::StringRef or a null-terminated const char *. In functions taking UTF-16, input is either in the form of a bslstl::StringRefWide or a pointer to a null-terminated array of unsigned short or wchar_t.
History and Motivation:
UTF-8 is an encoding that allows 32-bit character sets like Unicode to be represented using (8-bit) byte strings, while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 3629 (http://www.ietf.org/rfc/rfc3629.txt).
UTF-16 is a 16-bit encoding that allows Unicode code points up to 0x10ffff to be encoded using one or two 16-bit values. Note that UTF-16 is described in detail in RFC 2781 (http://www.ietf.org/rfc/rfc2781.txt).
The functions here that translate to fixed buffers make a single pass through the data. The functions that translate to bsl::strings and STL containers, however, like the glib conversion routines, make two passes: a size estimation pass, after which the output container is sized appropriately, and then the translation pass.
The methods that output to a vector, string, or wstring will all grow the output object as necessary to fit the data, and in the end will exactly resize the object to the output (including the terminating 0 for vector, which is not included for string or wstring). Note that in the case of string or wstring, the terminating 0 code point is still included in the code point count.
Non-minimal UTF-8 encodings of code points are reported as errors. Octets and post-conversion code points in the forbidden ranges are treated as errors and removed (or replaced, if a replacement word is provided).
WSTRINGS and UTF-16:
UTF-16 (or UTF-8, for that matter) can be stored in wstrings, but note that the size of a wstring::value_type, also known as a wchar_t word, varies across different platforms -- it is 4 bytes on Solaris, Linux, and Darwin, and 2 bytes on AIX and Windows. So a file of wchar_t words written by one platform may not be readable by another. Byte order is also a consideration, and a non-host byte order can be handled by using the optional byteOrder argument of these functions. Another factor is that, since UTF-16 words all fit in 2 bytes, using wchar_t to store UTF-16 is very wasteful of space on many platforms.
Usage:
In this section we show intended use of this component.
Example 1: Translation to Fixed-Length Buffers:
In this example, we will translate a string containing a non-ASCII code point from UTF-16 to UTF-8 and back using fixed-length buffers.
First, we create a UTF-16 string spelling ecole in French, which begins with 0xc9, a non-ASCII e with an accent over it:
  unsigned short utf16String[] = { 0xc9, 'c', 'o', 'l', 'e', 0 };
Then, we create a byte buffer to store the UTF-8 result of the translation in, and variables to monitor counts of code points and bytes translated:
  char utf8String[7];
  bsl::size_t numCodePoints, numBytes;
  numCodePoints = numBytes = -1;    // garbage
Next, we call utf16ToUtf8 to do the translation:
  int rc = bdlde::CharConvertUtf16::utf16ToUtf8(utf8String,
                                                sizeof(utf8String),
                                                utf16String,
                                                &numCodePoints,
                                                &numBytes);
Then, we observe that no errors or warnings occurred, and that the numbers of code points and bytes were as expected. Note that both numCodePoints and numBytes include the terminating 0:
  assert(0 == rc);
  assert(6 == numCodePoints);
  assert(7 == numBytes);
Next, we examine the length of the translated string:
  assert(numBytes - 1 == bsl::strlen(utf8String));
Then, we examine the individual bytes of the translated UTF-8:
  assert((char)0xc3 == utf8String[0]);
  assert((char)0x89 == utf8String[1]);
  assert('c' ==        utf8String[2]);
  assert('o' ==        utf8String[3]);
  assert('l' ==        utf8String[4]);
  assert('e' ==        utf8String[5]);
  assert(0   ==        utf8String[6]);
Next, in preparation for translation back to UTF-16, we create a buffer of short values and the variable numWords to track the number of UTF-16 words occupied by the result:
  unsigned short secondUtf16String[6];
  bsl::size_t numWords;
  numCodePoints = numWords = -1;    // garbage
Then, we do the reverse translation:
  rc = bdlde::CharConvertUtf16::utf8ToUtf16(secondUtf16String,
                                            6,
                                            utf8String,
                                            &numCodePoints,
                                            &numWords);
Next, we observe that no errors or warnings were reported, and that the number of code points and words were as expected. Note that numCodePoints and numWords both include the terminating 0:
  assert(0 == rc);
  assert(6 == numCodePoints);
  assert(6 == numWords);
Now, we observe that our output is identical to the original UTF-16 string:
  assert(0 == bsl::memcmp(utf16String,
                          secondUtf16String,
                          sizeof(utf16String)));
Finally, we examine the individual words of the reverse translation:
  assert(0xc9 == secondUtf16String[0]);
  assert('c'  == secondUtf16String[1]);
  assert('o'  == secondUtf16String[2]);
  assert('l'  == secondUtf16String[3]);
  assert('e'  == secondUtf16String[4]);
  assert(0    == secondUtf16String[5]);
Example 2: Translation to STL Containers:
The following snippets of code illustrate a typical use of the bdlde::CharConvertUtf16 struct's utility functions, first converting from UTF-8 to UTF-16, and then converting back to make sure the round trip returns the same value, translating to STL containers in both directions.
First, we declare a string of UTF-8 containing single-, double-, triple-, and quadruple-octet code points:
  const char utf8MultiLang[] = {
      "Hello"                                         // -- ASCII
      "\xce\x97"         "\xce\x95"       "\xce\xbb"  // -- Greek
      "\xe4\xb8\xad"     "\xe5\x8d\x8e"               // -- Chinese
      "\xe0\xa4\xad"     "\xe0\xa4\xbe"               // -- Hindi
      "\xf2\x94\xb4\xa5" "\xf3\xb8\xac\x83" };        // -- Quad octets
Then, we declare an enum summarizing the counts of code points in the string and verify that the counts add up to the length of the string:
  enum { NUM_ASCII_CODE_POINTS   = 5,
         NUM_GREEK_CODE_POINTS   = 3,
         NUM_CHINESE_CODE_POINTS = 2,
         NUM_HINDI_CODE_POINTS   = 2,
         NUM_QUAD_CODE_POINTS    = 2 };

  assert(1 * NUM_ASCII_CODE_POINTS +
         2 * NUM_GREEK_CODE_POINTS +
         3 * NUM_CHINESE_CODE_POINTS +
         3 * NUM_HINDI_CODE_POINTS +
         4 * NUM_QUAD_CODE_POINTS == bsl::strlen(utf8MultiLang));
Next, we declare the vector where our UTF-16 output will go, and a variable into which the number of code points (not bytes or words) written will be stored. It is not necessary to initialize utf16CodePointsWritten:
  bsl::vector<unsigned short> v16;
  bsl::size_t utf16CodePointsWritten;
Note that for performance, we should v16.reserve(sizeof(utf8MultiLang)), but it's not strictly necessary -- the vector will automatically be grown to the correct size. Also note that if v16 were not empty, that wouldn't be a problem -- any contents will be discarded.
Then, we do the translation to UTF-16:
  int retVal = bdlde::CharConvertUtf16::utf8ToUtf16(&v16,
                                                    utf8MultiLang,
                                                    &utf16CodePointsWritten);

  assert(0 == retVal);        // verify success
  assert(0 == v16.back());    // verify null terminated
Next, we verify that the number of code points (not bytes or words) that was returned is correct:
  enum { EXPECTED_CODE_POINTS_WRITTEN =
                      NUM_ASCII_CODE_POINTS + NUM_GREEK_CODE_POINTS +
                      NUM_CHINESE_CODE_POINTS + NUM_HINDI_CODE_POINTS +
                      NUM_QUAD_CODE_POINTS  + 1 };

  assert(EXPECTED_CODE_POINTS_WRITTEN == utf16CodePointsWritten);
Then, we verify that the number of 16-bit words written was correct. The quad octet code points each require 2 short words of output:
  enum { EXPECTED_UTF16_WORDS_WRITTEN =
                      NUM_ASCII_CODE_POINTS + NUM_GREEK_CODE_POINTS +
                      NUM_CHINESE_CODE_POINTS + NUM_HINDI_CODE_POINTS +
                      NUM_QUAD_CODE_POINTS * 2 + 1 };

  assert(EXPECTED_UTF16_WORDS_WRITTEN == v16.size());
Next, we calculate and confirm the difference between the number of UTF-16 words output and the number of bytes input. The ASCII code points will take 1 16-bit word apiece, the Greek code points are double octets that will become single short values, the Chinese code points are encoded as UTF-8 triple octets that will turn into single 16-bit words, the same for the Hindi code points, and the quad code points are quadruple octets that will turn into double short values:
  enum { SHRINKAGE = NUM_ASCII_CODE_POINTS   * (1-1) +
                     NUM_GREEK_CODE_POINTS   * (2-1) +
                     NUM_CHINESE_CODE_POINTS * (3-1) +
                     NUM_HINDI_CODE_POINTS   * (3-1) +
                     NUM_QUAD_CODE_POINTS    * (4-2) };

  assert(v16.size() == sizeof(utf8MultiLang) - SHRINKAGE);
Then, we go on to do the reverse utf16ToUtf8 transform to turn it back into UTF-8, and we should get a result identical to our original input. We declare a bsl::string for our output, and a variable to count the number of code points (not bytes or words) translated:
  bsl::string s;
  bsl::size_t uf8CodePointsWritten;
Again, note that for performance, we should ideally s.reserve(3 * v16.size()) but it's not really necessary.
Now, we do the reverse transform:
  retVal = bdlde::CharConvertUtf16::utf16ToUtf8(&s,
                                                v16.begin(),
                                                &uf8CodePointsWritten);
Finally, we verify that a successful status was returned, that the output of the reverse transform was identical to the original input, and that the number of code points translated was as expected:
  assert(0 == retVal);
  assert(utf8MultiLang == s);
  assert(s.length() + 1               == sizeof(utf8MultiLang));

  assert(EXPECTED_CODE_POINTS_WRITTEN == uf8CodePointsWritten);
  assert(utf16CodePointsWritten       == uf8CodePointsWritten);