BDE 4.14.0 Production release
|
Provide automata converting to and from Quoted-Printable encodings.
This component provides a class that can be used to encode byte sequences of arbitrary length into the Quoted Printable representation described in Section 6.7 "Quoted-Printable Content Transfer Encoding" of RFC 2045, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies."
Each instance of the encoder retains the state of the conversion from one supplied input to the next, enabling the processing of segmented input – i.e., processing resumes where it left off with the next invocation on new input. Instance methods are provided for the encoder to (1) assert the end of input, (2) determine whether the input so far is currently acceptable, and (3) indicate whether a non-recoverable error has occurred.
This encoding scheme is suitable for encoding arbitrary data consisting primarily of printable text characters. Additionally, this scheme seeks to preserve the integrity of the byte stream during transfer by making it difficult for any intermediate interpreting software in the path of the transfer to disruptively change its content (e.g., because of trailing whitespace and line breaks). For binary data, Base64 encoding may be a more appropriate scheme (see bdlde_base64 ).
The data stream is processed one byte at a time from left to right as follows:
Any 8-bit input character, except a CR or LF, may be represented by an "=" followed by a 2-digit hexadecimal representation of its ASCII value. Only uppercase hexadecimal digits are allowed. For example, the letter n
can be encoded into =6E
.
Characters with decimal values in the range [33..126], with the exception of 61 (=
), may be represented literally as they appear before encoding. Hence, in addition to [0-9][a-z][A-Z], the following characters may propagate to the encoded stream unchanged.
Space and tab may be represented literally, unless they appear at the end of an encoded line, in which case they must be followed by a =
character serving as a soft line break (see rule #5), or they must be encoded according to rule #1. It follows that any trailing whitespace encountered in a Quoted-Printable body must necessarily be added by intermediate transport agents and must be deleted during decoding.
A line break must be represented in the Quoted-Printable encoding as in rule number 1, i.e., LF -> =0A; CR -> =0D.
Encoded lines are required to be no longer than 76 characters in this encoding scheme. Soft line breaks in the form of an =
sign placed at the end of an encoded line are used to break up longer lines, either necessarily when the number of encoded characters, including any =
characters but not counting the trailing CRLF, reaches the limit of 76, or at the user's discretion – e.g., during manual encoding. Soft line breaks are to be removed during decoding as they are not part of the original content.
The Quoted-Printable encoding scheme allows one or two forms of encoding depending on the value of the character to be encoded as well as its location with respect to the end of line. When both forms are permissible, the choice is discretionary. For example, the word From
is often used as a message separator in the standard UNIX mail folder format. To reduce the chance of a message getting broken, a sentence such as "From point A ..." is often best encoded as "=46rom point A ...", although "From point A ..." is also a valid encoding.
This implementation by default prefers literal encoding to Quoted-Printable encoding. In the case of a space or tab character happening at the end of an encoded line, if there is more input to follow, a soft line break is inserted; otherwise, the last line of encoding should be terminated with the Quoted Printable encoding of space or tab (a line break is both redundant and contrived).
In situations where it is desirable to specify certain characters to be encoded to their numeric form, the encoder in this implementation also offers a means to specify these characters through the first parameter to the following constructor
The following examples demonstrate the above rules per the design choices made for this implementation. Note that there is a hard line break at the 77th character position, immediately after "dozing".
Data:
Encoding:
Data:
(The last line of input ends with a whitespace.)
Encoding:
(In this case, a Quoted Printable is preferred to soft line break as there should only be one encoded line.)
The above encoding is acceptable, although it is by no means unique.
(In the following, all rules mentioned refer to those listed in the encoder section above.)
The decoding process for this encoding scheme involves:
=
prefix (i.e., concatenating broken sentences) (rule #5).The standard imposes a maximum of 76 characters exclusive of CRLF; however, the decoder implemented in this component will handle lines of arbitrary length.
The decoder also provides support for two error-reporting modes, configurable at construction: the strict mode and the relaxed mode. A strict-mode decoder stops decoding at the first offending character encountered, while a relaxed-mode decoder continues decoding to the end of the input, allowing straight pass-through of character sets that cannot be interpreted.
This section illustrates intended use of this component.
The following example shows how to use a bdlde::QuotedPrintableEncoder
object to implement a function, streamconverter
, that reads text from a bsl::istream
, encodes that text in Quoted-Printable representation, and write928s the encoded text to a bsl::ostream
. streamconverter
returns 0 on success, and a negative value if the input data could not be successfully encoded or if there is an I/O error.
We will use fixed-sized input and output buffers in the implementation, but, because of the flexibility of bsl::istream
and the output-buffer monitoring functionality of QuotedPrintableEncoder
, the fixed buffer sizes do not limit the quantity of data that can be read, encoded, or written to the output stream. The implementation file is as follows.
We declare a bdlde::QuotedPrintableEncoder
object converter
, which will encode the input data. Note that various internal buffers and cursors are used as needed without further comment. We read as much data as is available from the user-supplied input stream is
or as much as will fit in inputBuffer
before beginning conversion.
With inputBuffer
now populated, we'll use converter
in an inner while
loop to encode the input and write the encoded data to outputBuffer
(via the output
cursor'). Note that if the call to converter.convert
fails, our function terminates with a negative status.
If the call to converter.convert
returns successfully, we'll see if the output buffer is full, and if so, write its contents to the user-supplied output stream os
. Note how we use the values of numOut
and numIn
generated by convert
to update the relevant cursors.
We have now exited both the input and the "encode" loops. converter
may still hold encoded output characters, and so we call converter.endConvert
to emit any retained output. To guarantee correct behavior, we call this method in an infinite loop, because it is possible that the retained output can fill the output buffer. In that case, we solve the problem by writing the contents of the output buffer to os
within the loop. The most likely case, however, is that endConvert
will return 0, in which case we exit the loop and write any data remaining in outputBuffer
to os
. As above, if endConvert
fails, we exit the function with a negative return status.
For ease of reading, we repeat the full content of the streamconverter.cpp
file without interruption.