BDE 4.14.0 Production release
|
Provide efficient conversions between UTF-8 and UCS-2 encodings.
This component provides a suite of pure procedures supporting the fast conversion of valid UTF-8 encoded "C" strings to valid UCS-2 16-bit character arrays and vice versa. In order to provide the fastest possible implementation, some error checking is deliberately omitted, and the input strings are required to be null-terminated; however, all C-style functions will honor strlcpy
semantics and null-terminate any output buffer having a non-zero length.
UTF-8 is a character encoding that allows 32-bit character sets like Unicode to be represented using null-terminated (8-bit) byte strings (NTBS), while allowing "standard ASCII" strings to be used "as-is". Note that UTF-8 is described in detail in RFC 2279 (http://tools.ietf.org/html/rfc2279).
UCS-2 is a 16-bit character encoding with no support for "higher-order" character encodings. UCS-2 is equivalent to UTF-16 in the Basic Multilingual Plane (BMP) of Unicode (the first 65536 character points, excluding the "surrogate code points" U+D800-U+DFFF, which do not map to Unicode characters). If the characters being represented are within the BMP, then UCS-2 can be thought of as "the Windows encoding" for international characters. Historically, UCS-2 was the only "wide char" representation for Windows versions prior to Windows 2000. UTF-16 was adopted instead for Windows 2000, and has been used ever since.
Most conversion routines strive for correctness at the cost of performance. The glib
conversion routines are much slower than the functions implemented here because the glib
functions first compute the number of output characters required, allocate the memory for them, and then perform the conversion, validating the input characters. The C-style methods of bdlde::CharConvertUcs2
, on the other hand, assume that the user-provided output buffer is wide enough, make a "best effort" to convert into it, and return an error code if not enough space was provided. The C++-style methods are more forgiving, since the output bsl::string
or bsl::vector<unsigned short>
is resized as needed. No attempt is made to validate whether the character codes correspond to valid Unicode code points, nor is validation performed to check for overlong UTF-8 encodings (where characters that could be expressed in one octet are encoded using two octets).
This section illustrates intended use of this component.
The following snippet of code illustrates a typical use of the bdlde::CharConvertUcs2
struct's C-style utility functions, converting a simple UTF-8 string to UCS-2.
The following snippet of code illustrates another typical use of the bdlde::CharConvertUcs2
struct's C-style utility functions, converting a simple UTF-8 string to UCS-2, then converting the UCS-2 back and making sure the round-trip conversion results in the input.
In this example, a UTF-8 input string is converted then passed to another function, which expects a UCS-2 buffer.
First, we define a utility strlen replacement for UCS-2:
Now, some arbitrary function that calls wideStrlen
:
Finally, we can take some UTF-8 as an input and call functionRequiringUcs2
:
The following snippet of code illustrates a typical use of the bdlde::CharConvertUcs2
struct's C++-style utility functions, converting a simple UTF-8 string to UCS-2.
The following snippet of code illustrates another typical use of the bdlde::CharConvertUcs2
struct's C++-style utility functions, first converting from UTF-8 to UCS-2, and then converting back to make sure the round trip returns the same value.
In this example, a UTF-8 input string is converted then returned.