BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlde_charconvertucs2

Detailed Description

Outline

Purpose

Provide efficient conversions between UTF-8 and UCS-2 encodings.

Classes

Description

This component provides a suite of pure procedures supporting the fast conversion of valid UTF-8 encoded "C" strings to valid UCS-2 16-bit character arrays and vice versa. In order to provide the fastest possible implementation, some error checking is deliberately omitted, and the input strings are required to be null-terminated; however, all C-style functions will honor strlcpy semantics and null-terminate any output buffer having a non-zero length.

History and Motivation

UTF-8 is a character encoding that allows 32-bit character sets like Unicode to be represented using null-terminated (8-bit) byte strings (NTBS), while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 2279 (http://tools.ietf.org/html/rfc2279).

UCS-2 is a 16-bit character encoding with no support for "higher-order" character encodings. UCS-2 is equivalent to UTF-16 in the Basic Multilingual Plane (BMP) of Unicode (the first 65536 character points, excluding the "surrogate code points" U+D800-U+DFFF, which do not map to Unicode characters). If the characters being represented are within the BMP, then UCS-2 can be thought of as "the Windows encoding" for international characters. Historically, UCS-2 was the only "wide char" representation for Windows versions prior to Windows 2000. UTF-16 was adopted instead for Windows 2000, and has been used ever since.

Most conversion routines strive for correctness at the cost of performance. The glib conversion routines are much slower than the functions implemented here because the glib functions first compute the number of output characters required, allocate the memory for them, and then perform the conversion, validating the input characters. The C-style methods of bdlde::CharConvertUcs2, on the other hand, assume that the user-provided output buffer is wide enough, make a "best effort" to convert into it, and return an error code if not enough space was provided. The C++-style methods are more forgiving, since the output bsl::string or bsl::vector<unsigned short> is resized as needed. No attempt is made to validate whether the character codes correspond to valid Unicode code points, nor is validation performed to check for overlong UTF-8 encodings (where characters that could be expressed in one octet are encoded using two octets).

Usage

This section illustrates intended use of this component.

Example 1: C-Style Interface

The following snippet of code illustrates a typical use of the bdlde::CharConvertUcs2 struct's C-style utility functions, converting a simple UTF-8 string to UCS-2.

void testCFunction1()
{
unsigned short buffer[256]; // arbitrary "wide-enough" size
bsl::size_t buffSize = sizeof buffer / sizeof *buffer;
bsl::size_t charsWritten;
int retVal =
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(buffer,
buffSize,
"Hello",
&charsWritten);
assert( 0 == retVal);
assert('H' == buffer[0]);
assert('e' == buffer[1]);
assert('l' == buffer[2]);
assert('l' == buffer[3]);
assert('o' == buffer[4]);
assert( 0 == buffer[5]);
assert( 6 == charsWritten);
}

Example 2: C-Style Round-Trip

The following snippet of code illustrates another typical use of the bdlde::CharConvertUcs2 struct's C-style utility functions, converting a simple UTF-8 string to UCS-2, then converting the UCS-2 back and making sure the round-trip conversion results in the input.

void testCFunction2()
{
unsigned short buffer[256]; // arbitrary "wide-enough" size
bsl::size_t buffSize = sizeof buffer / sizeof *buffer;
bsl::size_t charsWritten;
// "&Eacute;cole", the French word for School. '&Eacute;' is the HTML
// entity equivalent to "Unicode-E WITH ACUTE, LATIN CAPITAL LETTER".
int retVal =
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(buffer,
buffSize,
"\xc3\x89" "cole",
&charsWritten);
assert( 0 == retVal);
assert(0xc9 == buffer[0]); // Unicode-E WITH ACUTE, LATIN CAPITAL LETTER
assert('c' == buffer[1]);
assert('o' == buffer[2]);
assert('l' == buffer[3]);
assert('e' == buffer[4]);
assert( 0 == buffer[5]);
assert( 6 == charsWritten);
char buffer2[256]; // arbitrary "wide-enough" size
bsl::size_t buffer2Size = sizeof buffer2 / sizeof *buffer2;
bsl::size_t bytesWritten = 0;
// Reversing the conversion returns the original string:
retVal =
BloombergLP::bdlde::CharConvertUcs2::ucs2ToUtf8(buffer2,
buffer2Size,
buffer,
&charsWritten,
&bytesWritten);
assert( 0 == retVal);
assert( 0 == bsl::strcmp(buffer2, "\xc3\x89" "cole"));
// 6 characters written, but 7 bytes, since the first character takes 2
// octets.
assert( 6 == charsWritten);
assert( 7 == bytesWritten);
}

In this example, a UTF-8 input string is converted then passed to another function, which expects a UCS-2 buffer.

First, we define a utility strlen replacement for UCS-2:

int wideStrlen(const unsigned short *str)
{
int len = 0;
while (*str++) {
++len;
}
return len;
}

Now, some arbitrary function that calls wideStrlen:

void functionRequiringUcs2(const unsigned short *str, bsl::size_t strLen)
{
// Would probably do something more reasonable here.
assert(wideStrlen(str) + 1 == static_cast<int>(strLen));
}

Finally, we can take some UTF-8 as an input and call functionRequiringUcs2:

void processUtf8(const char *strU8)
{
unsigned short buffer[1024]; // some "large enough" size
bsl::size_t buffSize = sizeof buffer / sizeof *buffer;
bsl::size_t charsWritten = 0;
int result =
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(buffer,
buffSize,
strU8,
&charsWritten);
if (0 == result) {
functionRequiringUcs2(buffer, charsWritten);
}
}

Example 3: C++-Style Interface

The following snippet of code illustrates a typical use of the bdlde::CharConvertUcs2 struct's C++-style utility functions, converting a simple UTF-8 string to UCS-2.

void loadUCS2Hello(bsl::vector<unsigned short> *result)
{
int retVal =
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(result,
"Hello");
assert( 0 == retVal);
assert('H' == (*result)[0]);
assert('e' == (*result)[1]);
assert('l' == (*result)[2]);
assert('l' == (*result)[3]);
assert('o' == (*result)[4]);
assert( 0 == (*result)[5]);
assert( 6 == result->size());
}
size_type size() const BSLS_KEYWORD_NOEXCEPT
Return the number of elements in this vector.
Definition bslstl_vector.h:2664
Definition bslstl_vector.h:1025

The following snippet of code illustrates another typical use of the bdlde::CharConvertUcs2 struct's C++-style utility functions, first converting from UTF-8 to UCS-2, and then converting back to make sure the round trip returns the same value.

void checkCppRoundTrip()
{
// "&Eacute;cole", the French word for School. &Eacute; is the HTML
// entity corresponding to "Unicode-E WITH ACUTE, LATIN CAPITAL LETTER".
int retVal =
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(&result,
"\xc3\x89" "cole");
assert( 0 == retVal);
assert(0xc9 == result[0]); // Unicode-E WITH ACUTE, LATIN CAPITAL LETTER
assert('c' == result[1]);
assert('o' == result[2]);
assert('l' == result[3]);
assert('e' == result[4]);
assert( 0 == result[5]);
assert( 6 == result.size());
bsl::string result2;
bsl::size_t charsWritten = 0;
// Reversing the conversion returns the original string:
retVal =
BloombergLP::bdlde::CharConvertUcs2::ucs2ToUtf8(&result2,
&result.front(),
&charsWritten);
assert( 0 == retVal);
assert( result2 == "\xc3\x89" "cole");
// 6 characters written (including the null-terminator), and 6 bytes,
// since the first character takes 2 octets and the null-terminator is
// not counted in "length()".
assert( 6 == charsWritten);
assert( 6 == result2.length());
}
Definition bslstl_string.h:1281
size_type length() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6601
reference front()
Definition bslstl_vector.h:2567

In this example, a UTF-8 input string is converted then returned.

bsl::vector<unsigned short> processUtf8(const bsl::string& strU8)
{
BloombergLP::bdlde::CharConvertUcs2::utf8ToUcs2(&result, strU8.c_str());
return result;
}
const CHAR_TYPE * c_str() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6705