Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bdlde_utf8util
[Package bdlde]

Provide basic utilities for UTF-8 encodings. More...

Namespaces

namespace  bdlde

Detailed Description

Outline
Purpose:
Provide basic utilities for UTF-8 encodings.
Classes:
bdlde::Utf8Util namespace for utilities for UTF-8 encodings
Description:
This component provides, within the bdlde::Utf8Util struct, a suite of static functions supporting UTF-8 encoded strings. Two interfaces are provided for each function, one where the length of the string (in bytes) is passed as a separate argument, and one where the string is passed as a null-terminated C-style string.
A string is deemed to contain valid UTF-8 if it is compliant with RFC 3629, meaning that only 1-, 2-, 3-, and 4-byte sequences are allowed. Values above U+10ffff are also not allowed.
Six types of functions are provided:
  • isValid, which checks for validity, per RFC 3629, of a (candidate) UTF-8 string. "Overlong values", that is, values encoded in more bytes than necessary, are not tolerated; nor are "surrogate values", which are values in the range [U+d800 .. U+dfff].
  • advanceIfValid and advanceRaw, which advance some number of Unicode code points, each of which may be encoded in multiple bytes in a UTF-8 string. advanceRaw assumes the string is valid UTF-8, while advanceIfValid checks the input for validity and stops advancing if a sequence is encountered that is not valid UTF-8.
  • numCodePointsIfValid and numCodePointsRaw, which return the number of Unicode code points in a UTF-8 string. Note that numCodePointsIfValid both validates a (candidate) UTF-8 string and counts the number of Unicode code points that it contains.
  • numBytesIfValid, which returns the number of bytes a specified number of Unicode code points occupy in a UTF-8 string.
  • getByteSize, which returns the length of a single UTF-8 encoded character.
  • appendUtf8Character, which appends a single Unicode code point to a UTF-8 string.
Embedded null bytes are allowed in strings that are accompanied by an explicit length argument. Naturally, null-terminated C-style strings cannot contain embedded null code points.
The UTF-8 format is described in the RFC 3629 document at:
  http://tools.ietf.org/html/rfc3629
and in Wikipedia at:
  http://en.wikipedia.org/wiki/Utf-8
Empty Input Strings:
The utility functions provided by this component consider the empty string to be valid UTF-8. For those functions that take input as a (pointer, length) pair, if 0 == pointer and 0 == length, then the input is interpreted as a valid, empty string. However, if 0 == pointer and 0 != length, the behavior is undefined. All such functions have a counterpart that takes a lone pointer to a null-terminated (C-style) string. The behavior is always undefined if 0 is supplied for that lone pointer.
Usage:
In this section we show intended use of this component.
Example 1: Validating Strings and Counting Unicode Code Points:
In this usage example, we will encode some Unicode code points in UTF-8 strings and demonstrate those that are valid and those that are not.
First, we build an unquestionably valid UTF-8 string: Then, we check its validity and measure its length:
  assert(true == bdlde::Utf8Util::isValid(string.data(), string.length()));
  assert(true == bdlde::Utf8Util::isValid(string.c_str()));

  assert(   9 == bdlde::Utf8Util::numCodePointsRaw(string.data(),
                                                   string.length()));
  assert(   9 == bdlde::Utf8Util::numCodePointsRaw(string.c_str()));
Next, we encode a lone surrogate value, 0xd8ab, that we encode as the raw 3-byte sequence "\xed\xa2\xab" to avoid validation:
  bsl::string stringWithSurrogate = string + "\xed\xa2\xab";

  assert(false == bdlde::Utf8Util::isValid(stringWithSurrogate.data(),
                                           stringWithSurrogate.length()));
  assert(false == bdlde::Utf8Util::isValid(stringWithSurrogate.c_str()));
Then, we cannot use numCodePointsRaw to count the code points in stringWithSurrogate, since the behavior of that method is undefined unless the string is valid. Instead, the numCodePointsIfValid method can be used on strings whose validity we are uncertain of:
  const char *invalidPosition = 0;

  bsls::Types::IntPtr rc;
  rc = bdlde::Utf8Util::numCodePointsIfValid(&invalidPosition,
                                             stringWithSurrogate.data(),
                                             stringWithSurrogate.length());
  assert(rc < 0);
  assert(bdlde::Utf8Util::k_SURROGATE == rc);
  assert(invalidPosition == stringWithSurrogate.data() + string.length());

  invalidPosition = 0;  // reset

  rc = bdlde::Utf8Util::numCodePointsIfValid(&invalidPosition,
                                             stringWithSurrogate.c_str());
  assert(rc < 0);
  assert(bdlde::Utf8Util::k_SURROGATE == rc);
  assert(invalidPosition == stringWithSurrogate.data() + string.length());
Now, we encode 0, which is allowed. However, note that we cannot use any interfaces that take a null-terminated string for this case:
  bsl::string stringWithNull = string;
  stringWithNull += '\0';

  assert(true == bdlde::Utf8Util::isValid(stringWithNull.data(),
                                          stringWithNull.length()));

  assert(  10 == bdlde::Utf8Util::numCodePointsRaw(stringWithNull.data(),
                                                   stringWithNull.length()));
Finally, we encode 0x3a (:) as an overlong value using 2 bytes, which is not valid UTF-8 (since : can be "encoded" in 1 byte):
  bsl::string stringWithOverlong = string;
  stringWithOverlong += static_cast<char>(0xc0);        // start of 2-byte
                                                        // sequence
  stringWithOverlong += static_cast<char>(0x80 | ':');  // continuation byte

  assert(false == bdlde::Utf8Util::isValid(stringWithOverlong.data(),
                                           stringWithOverlong.length()));
  assert(false == bdlde::Utf8Util::isValid(stringWithOverlong.c_str()));

  rc = bdlde::Utf8Util::numCodePointsIfValid(&invalidPosition,
                                             stringWithOverlong.data(),
                                             stringWithOverlong.length());
  assert(rc < 0);
  assert(bdlde::Utf8Util::k_OVERLONG_ENCODING == rc);
  assert(invalidPosition == stringWithOverlong.data() + string.length());

  rc = bdlde::Utf8Util::numCodePointsIfValid(&invalidPosition,
                                             stringWithOverlong.c_str());
  assert(rc < 0);
  assert(bdlde::Utf8Util::k_OVERLONG_ENCODING == rc);
  assert(invalidPosition == stringWithOverlong.data() + string.length());
Example 2: Advancing Over a Given Number of Code Points:
In this example, we will use the various advance functions to advance through a UTF-8 string.
First, build the string using appendUtf8CodePoint, keeping track of how many bytes are in each Unicode code point:
  bsl::string string;
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 0xff00);        // 3 bytes
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1ff);         // 2 bytes
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 'a');           // 1 byte
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1008aa);      // 4 bytes
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1abcd);       // 4 bytes
  string += "\xe3\x8f\xfe";           // 3 bytes (invalid 3-byte sequence,
                                      // the first 2 bytes are valid but the
                                      // last continuation byte is invalid)
  bdlde::Utf8Util::appendUtf8CodePoint(&string, 'w');           // 1 byte
  bdlde::Utf8Util::appendUtf8CodePoint(&string, '\n');          // 1 byte
Then, declare a few variables we'll need:
  bsls::Types::IntPtr  rc;
  int                  status;
  const char          *result;
  const char *const start = string.c_str();
Next, try advancing 2 code points, then 3, then 4, observing that the value returned is the number of Unicode code points advanced. Note that since we're only advancing over valid UTF-8, we can use either advanceRaw or advanceIfValid:
  rc = bdlde::Utf8Util::advanceRaw(              &result, start, 2);
  assert(2 == rc);
  assert(3 + 2 == result - start);

  rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 2);
  assert(0 == status);
  assert(2 == rc);
  assert(3 + 2 == result - start);

  rc = bdlde::Utf8Util::advanceRaw(             &result, start, 3);
  assert(3 == rc);
  assert(3 + 2 + 1 == result - start);

  rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 3);
  assert(0 == status);
  assert(3 == rc);
  assert(3 + 2 + 1 == result - start);

  rc = bdlde::Utf8Util::advanceRaw(             &result, start, 4);
  assert(4 == rc);
  assert(3 + 2 + 1 + 4 == result - start);

  rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 4);
  assert(0 == status);
  assert(4 == rc);
  assert(3 + 2 + 1 + 4 == result - start);
Then, try advancing by more code points than are present using advanceIfValid, and wind up stopping when we encounter invalid input. The behavior of advanceRaw is undefined if it is used on invalid input, so we cannot use it here. Also note that we will stop at the beginning of the invalid Unicode code point, and not at the first incorrect byte, which is two bytes later:
  rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, INT_MAX);
  assert(0 != status);
  assert(5 == rc);
  assert(3 + 2 + 1 + 4 + 4                 == result - start);
  assert(static_cast<int>(string.length()) >  result - start);
Now, doctor the string to replace the invalid code point with a valid one, so the string is entirely correct UTF-8:
  string[3 + 2 + 1 + 4 + 4 + 2] = static_cast<char>(0x8a);
Finally, advance using both functions by more code points than are in the string and in both cases wind up at the end of the string. Note that advanceIfValid does not return an error (non-zero) value to status when it encounters the end of the string:
  rc = bdlde::Utf8Util::advanceRaw(             &result, start, INT_MAX);
  assert(8 == rc);
  assert(3 + 2 + 1 + 4 + 4 + 3 + 1 + 1     == result - start);
  assert(static_cast<int>(string.length()) == result - start);

  rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, INT_MAX);
  assert(0 == status);
  assert(8 == rc);
  assert(3 + 2 + 1 + 4 + 4 + 3 + 1 + 1     == result - start);
  assert(static_cast<int>(string.length()) == result - start);
Example 3: Validating UTF-8 Read from a bsl::streambuf:
In this usage example, we will demonstrate reading and validating UTF-8 from a stream.
We write a function to read valid UTF-8 to a bsl::string. We don't know how long the input will be, so we don't know how long to make the string before we start. We will grow the string in small, 32-byte increments.
  int utf8StreambufToString(bsl::string    *output,
                            bsl::streambuf *sb)
      // Read valid UTF-8 from the specified streambuf 'sb' to the specified
      // 'output'.  Return 0 if the input was exhausted without encountering
      // any invalid UTF-8, and a non-zero value otherwise.  If invalid UTF-8
      // is encountered, log a message describing the problem after loading
      // all the valid UTF-8 preceding it into 'output'.  Note that after the
      // call, in no case will 'output' contain any invalid UTF-8.
  {
      enum { k_READ_LENGTH = 32 };

      output->clear();
      while (true) {
          bsl::size_t len = output->length();
          output->resize(len + k_READ_LENGTH);
          int status;
          IntPtr numBytes = bdlde::Utf8Util::readIfValid(&status,
                                                         &(*output)[len],
                                                         k_READ_LENGTH,
                                                         sb);
          BSLS_ASSERT(0 <= numBytes);
          BSLS_ASSERT(numBytes <= k_READ_LENGTH);

          output->resize(len + numBytes);
          if (0 < status) {
              // Buffer was full before the end of input was encountered.
              // Note that 'numBytes' may be up to 3 bytes less than
              // 'k_READ_LENGTH'.

              BSLS_ASSERT(k_READ_LENGTH - 4 < numBytes);

              // Go on to grow the string and get more input.

              continue;
          }
          else if (0 == status) {
              // Success!  We've reached the end of input without
              // encountering any invalid UTF-8.

              return 0;                                             // RETURN
          }
          else {
              // Invalid UTF-8 encountered; the value of 'status' indicates
              // the exact nature of the problem.  'numBytes' returned from
              // the above call indicated the number of valid UTF-8 bytes
              // read before encountering the invalid UTF-8.

              BSLS_LOG_ERROR("Invalid UTF-8 error %s at position %u.\n",
                             bdlde::Utf8Util::toAscii(status),
                             static_cast<unsigned>(output->length()));

              return -1;                                            // RETURN
          }
      }
  }