BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlde_utf8util

Detailed Description

Outline

Purpose

Provide basic utilities for UTF-8 encodings.

Classes

Description

This component provides, within the bdlde::Utf8Util struct, a suite of static functions supporting UTF-8 encoded strings. Two interfaces are provided for each function, one where the length of the string (in bytes) is passed as a separate argument, and one where the string is passed as a null-terminated C-style string.

A string is deemed to contain valid UTF-8 if it is compliant with RFC 3629, meaning that only 1-, 2-, 3-, and 4-byte sequences are allowed. Values above U+10ffff are also not allowed.

Seven types of functions are provided:

Embedded null bytes are allowed in strings that are accompanied by an explicit length argument. Naturally, null-terminated C-style strings cannot contain embedded null code points.

The UTF-8 format is described in the RFC 3629 document at:

http://tools.ietf.org/html/rfc3629

and in Wikipedia at:

http://en.wikipedia.org/wiki/Utf-8

Empty Input Strings

The utility functions provided by this component consider the empty string to be valid UTF-8. For those functions that take input as a (pointer, length) pair, if 0 == pointer and 0 == length, then the input is interpreted as a valid, empty string. However, if 0 == pointer and 0 != length, the behavior is undefined. All such functions have a counterpart that takes a lone pointer to a null-terminated (C-style) string. The behavior is always undefined if 0 is supplied for that lone pointer.

Usage

This section illustrates intended use of this component.

Example 1: Validating Strings and Counting Unicode Code Points

In this usage example, we will encode some Unicode code points in UTF-8 strings and demonstrate those that are valid and those that are not.

First, we build an unquestionably valid UTF-8 string:

bsl::string string;
Definition bslstl_string.h:1281
static int appendUtf8CodePoint(bsl::string *output, unsigned int codePoint)

Then, we check its validity and measure its length:

assert(true == bdlde::Utf8Util::isValid(string.data(), string.length()));
assert(true == bdlde::Utf8Util::isValid(string.c_str()));
assert( 9 == bdlde::Utf8Util::numCodePointsRaw(string.data(),
string.length()));
assert( 9 == bdlde::Utf8Util::numCodePointsRaw(string.c_str()));
static bool isValid(const char *string)
Definition bdlde_utf8util.h:983
static IntPtr numCodePointsRaw(const char *string)

Next, we encode a lone surrogate value, 0xd8ab, that we encode as the raw 3-byte sequence "\xed\xa2\xab" to avoid validation:

bsl::string stringWithSurrogate = string + "\xed\xa2\xab";
assert(false == bdlde::Utf8Util::isValid(stringWithSurrogate.data(),
stringWithSurrogate.length()));
assert(false == bdlde::Utf8Util::isValid(stringWithSurrogate.c_str()));
size_type length() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6601
const CHAR_TYPE * c_str() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6705
CHAR_TYPE * data() BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6477

Then, we cannot use numCodePointsRaw to count the code points in stringWithSurrogate, since the behavior of that method is undefined unless the string is valid. Instead, the numCodePointsIfValid method can be used on strings whose validity we are uncertain of:

const char *invalidPosition = 0;
stringWithSurrogate.data(),
stringWithSurrogate.length());
assert(rc < 0);
assert(invalidPosition == stringWithSurrogate.data() + string.length());
invalidPosition = 0; // reset
stringWithSurrogate.c_str());
assert(rc < 0);
assert(invalidPosition == stringWithSurrogate.data() + string.length());
static IntPtr numCodePointsIfValid(const char **invalidString, const char *string)
@ k_SURROGATE
Definition bdlde_utf8util.h:442
std::ptrdiff_t IntPtr
Definition bsls_types.h:130

Now, we encode 0, which is allowed. However, note that we cannot use any interfaces that take a null-terminated string for this case:

bsl::string stringWithNull = string;
stringWithNull += '\0';
assert(true == bdlde::Utf8Util::isValid(stringWithNull.data(),
stringWithNull.length()));
assert( 10 == bdlde::Utf8Util::numCodePointsRaw(stringWithNull.data(),
stringWithNull.length()));

Finally, we encode 0x3a (:) as an overlong value using 2 bytes, which is not valid UTF-8 (since : can be "encoded" in 1 byte):

bsl::string stringWithOverlong = string;
stringWithOverlong += static_cast<char>(0xc0); // start of 2-byte
// sequence
stringWithOverlong += static_cast<char>(0x80 | ':'); // continuation byte
assert(false == bdlde::Utf8Util::isValid(stringWithOverlong.data(),
stringWithOverlong.length()));
assert(false == bdlde::Utf8Util::isValid(stringWithOverlong.c_str()));
stringWithOverlong.data(),
stringWithOverlong.length());
assert(rc < 0);
assert(invalidPosition == stringWithOverlong.data() + string.length());
stringWithOverlong.c_str());
assert(rc < 0);
assert(invalidPosition == stringWithOverlong.data() + string.length());
@ k_OVERLONG_ENCODING
Definition bdlde_utf8util.h:431

Example 2: Advancing Over a Given Number of Code Points

In this example, we will use the various advance functions to advance through a UTF-8 string.

First, build the string using appendUtf8CodePoint, keeping track of how many bytes are in each Unicode code point:

bsl::string string;
bdlde::Utf8Util::appendUtf8CodePoint(&string, 0xff00); // 3 bytes
bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1ff); // 2 bytes
bdlde::Utf8Util::appendUtf8CodePoint(&string, 'a'); // 1 byte
bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1008aa); // 4 bytes
bdlde::Utf8Util::appendUtf8CodePoint(&string, 0x1abcd); // 4 bytes
string += "\xe3\x8f\xfe"; // 3 bytes (invalid 3-byte sequence,
// the first 2 bytes are valid but the
// last continuation byte is invalid)
bdlde::Utf8Util::appendUtf8CodePoint(&string, 'w'); // 1 byte
bdlde::Utf8Util::appendUtf8CodePoint(&string, '\n'); // 1 byte

Then, declare a few variables we'll need:

int status;
const char *result;
const char *const start = string.c_str();

Next, try advancing 2 code points, then 3, then 4, observing that the value returned is the number of Unicode code points advanced. Note that since we're only advancing over valid UTF-8, we can use either advanceRaw or advanceIfValid:

rc = bdlde::Utf8Util::advanceRaw( &result, start, 2);
assert(2 == rc);
assert(3 + 2 == result - start);
rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 2);
assert(0 == status);
assert(2 == rc);
assert(3 + 2 == result - start);
rc = bdlde::Utf8Util::advanceRaw( &result, start, 3);
assert(3 == rc);
assert(3 + 2 + 1 == result - start);
rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 3);
assert(0 == status);
assert(3 == rc);
assert(3 + 2 + 1 == result - start);
rc = bdlde::Utf8Util::advanceRaw( &result, start, 4);
assert(4 == rc);
assert(3 + 2 + 1 + 4 == result - start);
rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, 4);
assert(0 == status);
assert(4 == rc);
assert(3 + 2 + 1 + 4 == result - start);
static IntPtr advanceRaw(const char **result, const char *string, IntPtr numCodePoints)
static IntPtr advanceIfValid(int *status, const char **result, const char *string, IntPtr numCodePoints)

Then, try advancing by more code points than are present using advanceIfValid, and wind up stopping when we encounter invalid input. The behavior of advanceRaw is undefined if it is used on invalid input, so we cannot use it here. Also note that we will stop at the beginning of the invalid Unicode code point, and not at the first incorrect byte, which is two bytes later:

rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, INT_MAX);
assert(0 != status);
assert(5 == rc);
assert(3 + 2 + 1 + 4 + 4 == result - start);
assert(static_cast<int>(string.length()) > result - start);

Now, doctor the string to replace the invalid code point with a valid one, so the string is entirely correct UTF-8:

string[3 + 2 + 1 + 4 + 4 + 2] = static_cast<char>(0x8a);

Finally, advance using both functions by more code points than are in the string and in both cases wind up at the end of the string. Note that advanceIfValid does not return an error (non-zero) value to status when it encounters the end of the string:

rc = bdlde::Utf8Util::advanceRaw( &result, start, INT_MAX);
assert(8 == rc);
assert(3 + 2 + 1 + 4 + 4 + 3 + 1 + 1 == result - start);
assert(static_cast<int>(string.length()) == result - start);
rc = bdlde::Utf8Util::advanceIfValid(&status, &result, start, INT_MAX);
assert(0 == status);
assert(8 == rc);
assert(3 + 2 + 1 + 4 + 4 + 3 + 1 + 1 == result - start);
assert(static_cast<int>(string.length()) == result - start);

Example 3: Validating UTF-8 Read from a bsl::streambuf

In this usage example, we will demonstrate reading and validating UTF-8 from a stream.

We write a function to read valid UTF-8 to a bsl::string. We don't know how long the input will be, so we don't know how long to make the string before we start. We will grow the string in small, 32-byte increments.

/// Read valid UTF-8 from the specified streambuf `sb` to the specified
/// `output`. Return 0 if the input was exhausted without encountering
/// any invalid UTF-8, and a non-zero value otherwise. If invalid UTF-8
/// is encountered, log a message describing the problem after loading
/// all the valid UTF-8 preceding it into `output`. Note that after the
/// call, in no case will `output` contain any invalid UTF-8.
int utf8StreambufToString(bsl::string *output,
bsl::streambuf *sb)
{
enum { k_READ_LENGTH = 32 };
output->clear();
while (true) {
bsl::size_t len = output->length();
output->resize(len + k_READ_LENGTH);
int status;
IntPtr numBytes = bdlde::Utf8Util::readIfValid(&status,
&(*output)[len],
k_READ_LENGTH,
sb);
BSLS_ASSERT(0 <= numBytes);
BSLS_ASSERT(numBytes <= k_READ_LENGTH);
output->resize(len + numBytes);
if (0 < status) {
// Buffer was full before the end of input was encountered.
// Note that `numBytes` may be up to 3 bytes less than
// `k_READ_LENGTH`.
BSLS_ASSERT(k_READ_LENGTH - 4 < numBytes);
// Go on to grow the string and get more input.
continue;
}
else if (0 == status) {
// Success! We've reached the end of input without
// encountering any invalid UTF-8.
return 0; // RETURN
}
else {
// Invalid UTF-8 encountered; the value of `status` indicates
// the exact nature of the problem. `numBytes` returned from
// the above call indicated the number of valid UTF-8 bytes
// read before encountering the invalid UTF-8.
BSLS_LOG_ERROR("Invalid UTF-8 error %s at position %u.\n",
static_cast<unsigned>(output->length()));
return -1; // RETURN
}
}
}
void resize(size_type newLength, CHAR_TYPE character)
Definition bslstl_string.h:5364
void clear() BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:5430
#define BSLS_ASSERT(X)
Definition bsls_assert.h:1804
#define BSLS_LOG_ERROR(...)
Definition bsls_log.h:410
static size_type readIfValid(int *status, char *outputBuffer, size_type outputBufferLength, bsl::streambuf *input)
static const char * toAscii(IntPtr value)