Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bdlde_utf8checkinginstreambufwrapper
[Package bdlde]

Provide a stream buffer wrapper for validating UTF-8 input. More...

Namespaces

namespace  bdlde

Detailed Description

Outline
Purpose:
Provide a stream buffer wrapper for validating UTF-8 input.
Classes:
bdlde::Utf8CheckingInStreamBufWrapper wraps input streambuf, checks UTF-8
See also:
bsl_streambuf
Description:
This component provides a mechanism, bdlde::Utf8CheckingInStreamBufWrapper, that inherits from bsl::streambuf, and that holds and wraps another streambuf. It forwards input through the held streambuf, checking for invalid UTF-8. The wrapping object does not support output, only input. All normal input functions are supported. If the held streambuf supports seeking, seeks are supported, though not forward seeks, and pubseekoff(0, bsl::ios_base::cur) is supported whether the wrapped streambuf supports seeking or not.
Input is buffered, the buffer cannot be changed -- pubsetbuf is a no-op.
The client is normally recommended to use this object by reading from it until it behaves as though it has reached the end of input, and then call errorStatus to see if a UTF-8 error happened, and if so, then call pubseekoff(0, bsl::ios_base::cur) to find the position of the beginning of the invalid UTF-8 code point.
Positioning at the Start:
When starting to read, the wrapped streambuf must be positioned at the beginning of a UTF-8 code point, or the end of data, otherwise, the wrapper will interpret the first byte read as incorrect UTF-8.
Behavior of Reads:
If incorrect UTF-8 exists in the data stream, reads will succeed until reaching the start of the incorrect code point, after which reads will behave as though the end of data were reached. All data returned by reads will be valid UTF-8. Reads of limited length that end before the end of data may return incomplete, truncated portions of valid UTF-8 code points. In that case, following reads will return the remainder of the same valid UTF-8 code point.
errorStatus:
The errorStatus accessor is not a virtual function and is not inherited from streambuf.
If invalid UTF-8 is encountered while reading, input will succeed right up to the beginning of the invalid code point, at which point the object will behave as though it has reached the end of data, with the object positioned to exactly the start of the invalid code point. errorStatus will reflect the nature of the UTF-8 error.
If a seek error occurs, errorStatus will change to k_SEEK_FAIL and subsequent reads and relative seeks will fail, including pubseekoff(0, bsl::ios_base::cur). A reset or an absolute seek to the start of data will reset errorStatus to 0 and the object will recover to being able to perform input and relative seeks.
UTF-8 errors can be recovered from by calling reset or by seeking at least one byte backward. Note that pubseekoff(0, bsl::ios_base::cur) after a UTF-8 error will return the object's position without changing the error state. Note that an absolute seek to the beginning of data will not recover unless it amounts to a seek at least one byte backward.
If input has reached invalid UTf-8, errorStatus() will be negative, and one of the values from bdlde::Utf8Util::ErrorStatus.
The class method toAscii can be called to translate any value returned by errorStatus() to a human-readable string.
Seeking:
The wrapped streambuf must either support seeking or always return a negative value when a seek attempt is made.
Forward seeks and seeks relative to the end of data are not supported.
If the wrapped streambuf does not support seeking, pubseekoff(0, bsl::ios_base::cur) will still work on the wrapper and will return the offset relative to the input position when the wrapper was bound to the held streambuf, without changing the error state.
Seeks can fail for a number of reasons (see seekoff), and if that happens, the object will enter a "failed seek state", having no valid position, and will no longer be able to do input or do relative seeks until recovering by either doing an absolute seek to 0 or by having reset called. When the object is in a failed seek state, errorStatus() will equal k_SEEK_FAIL.
Valid State:
If the object has been bound via reset to a held streambuf and is not in a failed seek state, the object is in a valid state.
Usage:
Example 1: Detecting invalid UTF-8 read from a streambuf:
Suppose one has a streambuf, myStreamBuf containing UTF-8 that one wants to read, checking that it is valid UTF-8.
First, create a Utf8CheckingInStreamBufWrapper that will wrap myStreamBuf:
  typedef bdlde::Utf8CheckingInStreamBufWrapper Obj;
  Obj wrapper;
  wrapper.reset(&myStreamBuf);
Then, read the data from the wrapper streambuf until it stops yielding data.
  std::string s;
  bsl::streamsize len = 0, bytesRead;
  do {
      enum { k_READ_CHUNK = 10 };

      s.resize(len + k_READ_CHUNK);

      bytesRead = wrapper.sgetn(&s[len], k_READ_CHUNK);

      assert(0 <= bytesRead);
      assert(bytesRead <= k_READ_CHUNK);

      s.resize((len += bytesRead));
  } while (0 < bytesRead);

  assert(wrapper.pubseekoff(0, bsl::ios_base::cur) == Obj::pos_type(len));
Next, use the errorStatus accessor and pubseekoff manipulator to see what, if anything, went wrong and where.
  const int es = wrapper.errorStatus();

  if      (0 == es) {
      cout << "No errors occurred.\n";
  }
  else if (es < 0) {
      cout << "Incorrect UTF-8 encountered " << Obj::toAscii(es) <<
          " at offset " << wrapper.pubseekoff(0, bsl::ios_base::cur) << endl;
  }
  else {
      cout << "Non-UTF-8 error " << Obj::toAscii(es) << endl;
  }
Now, we observe the output:
  Incorrect UTF-8 encountered UNEXPECTED_CONTINUATION_OCTET at offset 79
Finally, we observe that all the data from myStreamBuf up to offset 79 was read into s, and that it's all correct UTF-8.
  assert(len == s.end() - s.begin());
  assert(bdlde::Utf8Util::isValid(&s[0], len));