BDE 4.14.0 Production release
|
Provide a stream buffer wrapper for validating UTF-8 input.
This component provides a mechanism, bdlde::Utf8CheckingInStreamBufWrapper
, that inherits from bsl::streambuf
, and that holds and wraps another streambuf
. It forwards input through the held streambuf, checking for invalid UTF-8. The wrapping object does not support output, only input. All normal input functions are supported. If the held streambuf
supports seeking, seeks are supported, though not forward seeks, and pubseekoff(0, bsl::ios_base::cur)
is supported whether the wrapped streambuf
supports seeking or not.
Input is buffered, the buffer cannot be changed – pubsetbuf
is a no-op.
The client is normally recommended to use this object by reading from it until it behaves as though it has reached the end of input, and then call errorStatus
to see if a UTF-8 error happened, and if so, then call pubseekoff(0, bsl::ios_base::cur)
to find the position of the beginning of the invalid UTF-8 code point.
When starting to read, the wrapped streambuf
must be positioned at the beginning of a UTF-8 code point, or the end of data, otherwise, the wrapper will interpret the first byte read as incorrect UTF-8.
If incorrect UTF-8 exists in the data stream, reads will succeed until reaching the start of the incorrect code point, after which reads will behave as though the end of data were reached. All data returned by reads will be valid UTF-8. Reads of limited length that end before the end of data may return incomplete, truncated portions of valid UTF-8 code points. In that case, following reads will return the remainder of the same valid UTF-8 code point.
The errorStatus
accessor is not a virtual function and is not inherited from streambuf
.
If invalid UTF-8 is encountered while reading, input will succeed right up to the beginning of the invalid code point, at which point the object will behave as though it has reached the end of data, with the object positioned to exactly the start of the invalid code point. errorStatus
will reflect the nature of the UTF-8 error.
If a seek error occurs, errorStatus
will change to k_SEEK_FAIL
and subsequent reads and relative seeks will fail, including pubseekoff(0, bsl::ios_base::cur)
. A reset
or an absolute seek to the start of data will reset errorStatus
to 0 and the object will recover to being able to perform input and relative seeks.
UTF-8 errors can be recovered from by calling reset
or by seeking at least one byte backward. Note that pubseekoff(0, bsl::ios_base::cur)
after a UTF-8 error will return the object's position without changing the error state. Note that an absolute seek to the beginning of data will not recover unless it amounts to a seek at least one byte backward.
If input has reached invalid UTf-8, errorStatus()
will be negative, and one of the values from bdlde::Utf8Util::ErrorStatus
.
The class method toAscii
can be called to translate any value returned by errorStatus()
to a human-readable string.
The wrapped streambuf
must either support seeking or always return a negative value when a seek attempt is made.
Forward seeks and seeks relative to the end of data are not supported.
If the wrapped streambuf
does not support seeking, pubseekoff(0, bsl::ios_base::cur)
will still work on the wrapper and will return the offset relative to the input position when the wrapper was bound to the held streambuf
, without changing the error state.
Seeks can fail for a number of reasons (see seekoff
), and if that happens, the object will enter a "failed seek state", having no valid position, and will no longer be able to do input or do relative seeks until recovering by either doing an absolute seek to 0 or by having reset
called. When the object is in a failed seek state, errorStatus()
will equal k_SEEK_FAIL
.
If the object has been bound via reset
to a held streambuf
and is not in a failed seek state, the object is in a valid state.
This section illustrates intended use of this component.
Suppose one has a streambuf
, myStreamBuf
containing UTF-8 that one wants to read, checking that it is valid UTF-8.
First, create a Utf8CheckingInStreamBufWrapper
that will wrap myStreamBuf
:
Then, read the data from the wrapper
streambuf
until it stops yielding data.
Next, use the errorStatus
accessor and pubseekoff
manipulator to see what, if anything, went wrong and where.
Now, we observe the output:
Finally, we observe that all the data from myStreamBuf
up to offset 79 was read into s
, and that it's all correct UTF-8.