BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlde_quotedprintableencoder

Detailed Description

Outline

Purpose

Provide automata converting to and from Quoted-Printable encodings.

Classes

See also
bdlde_quotedprintabledecoder

Description

This component provides a class that can be used to encode byte sequences of arbitrary length into the Quoted Printable representation described in Section 6.7 "Quoted-Printable Content Transfer Encoding" of RFC 2045, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies."

Each instance of the encoder retains the state of the conversion from one supplied input to the next, enabling the processing of segmented input – i.e., processing resumes where it left off with the next invocation on new input. Instance methods are provided for the encoder to (1) assert the end of input, (2) determine whether the input so far is currently acceptable, and (3) indicate whether a non-recoverable error has occurred.

Quoted-Printable Encoding

This encoding scheme is suitable for encoding arbitrary data consisting primarily of printable text characters. Additionally, this scheme seeks to preserve the integrity of the byte stream during transfer by making it difficult for any intermediate interpreting software in the path of the transfer to disruptively change its content (e.g., because of trailing whitespace and line breaks). For binary data, Base64 encoding may be a more appropriate scheme (see bdlde_base64 ).

The data stream is processed one byte at a time from left to right as follows:

General 8-Bit Representation

Any 8-bit input character, except a CR or LF, may be represented by an "=" followed by a 2-digit hexadecimal representation of its ASCII value. Only uppercase hexadecimal digits are allowed. For example, the letter n can be encoded into =6E.

Literal Representation

Characters with decimal values in the range [33..126], with the exception of 61 (=), may be represented literally as they appear before encoding. Hence, in addition to [0-9][a-z][A-Z], the following characters may propagate to the encoded stream unchanged.

[!"#$%&'()*+,-./:;<>?@[\]^_`{|}~]

Whitespace

Space and tab may be represented literally, unless they appear at the end of an encoded line, in which case they must be followed by a = character serving as a soft line break (see rule #5), or they must be encoded according to rule #1. It follows that any trailing whitespace encountered in a Quoted-Printable body must necessarily be added by intermediate transport agents and must be deleted during decoding.

Line Breaks

A line break must be represented in the Quoted-Printable encoding as in rule number 1, i.e., LF -> =0A; CR -> =0D.

Soft Line Breaks

Encoded lines are required to be no longer than 76 characters in this encoding scheme. Soft line breaks in the form of an = sign placed at the end of an encoded line are used to break up longer lines, either necessarily when the number of encoded characters, including any = characters but not counting the trailing CRLF, reaches the limit of 76, or at the user's discretion – e.g., during manual encoding. Soft line breaks are to be removed during decoding as they are not part of the original content.

The Quoted-Printable encoding scheme allows one or two forms of encoding depending on the value of the character to be encoded as well as its location with respect to the end of line. When both forms are permissible, the choice is discretionary. For example, the word From is often used as a message separator in the standard UNIX mail folder format. To reduce the chance of a message getting broken, a sentence such as "From point A ..." is often best encoded as "=46rom point A ...", although "From point A ..." is also a valid encoding.

This implementation by default prefers literal encoding to Quoted-Printable encoding. In the case of a space or tab character happening at the end of an encoded line, if there is more input to follow, a soft line break is inserted; otherwise, the last line of encoding should be terminated with the Quoted Printable encoding of space or tab (a line break is both redundant and contrived).

In situations where it is desirable to specify certain characters to be encoded to their numeric form, the encoder in this implementation also offers a means to specify these characters through the first parameter to the following constructor

const char *extraCharsToEncode,
int maxLineLength =
bdlde::QuotedPrintableEncoder::k_DEFAULT_MAX_LINELEN);
Definition bdlde_quotedprintableencoder.h:493
LineBreakMode
Definition bdlde_quotedprintableencoder.h:521
@ e_CRLF_MODE
Definition bdlde_quotedprintableencoder.h:525

The following examples demonstrate the above rules per the design choices made for this implementation. Note that there is a hard line break at the 77th character position, immediately after "dozing".

Example 1

Data:

From point A to point B, the distance is 1245.56 miles. Driving at a dozing
speed of 15mph, it will take 2 hours to complete the trip.

Encoding:

=46rom point A to point B, the distance is 1245.56 miles. Driving at a doz=
ing=0D=0A speed of 15mph, =
it will take 2 hours=
to complete the trip.

Example 2

Data:

Hello, world.

(The last line of input ends with a whitespace.)

Encoding:

Hello, world.=20

(In this case, a Quoted Printable is preferred to soft line break as there should only be one encoded line.)

The above encoding is acceptable, although it is by no means unique.

Quoted-Printable Decoding

(In the following, all rules mentioned refer to those listed in the encoder section above.)

The decoding process for this encoding scheme involves:

  1. transforming any encoded character triplets back into their original representation (rule #1 and rule #4).
  2. literally writing out characters that have not been changed (rule #2).
  3. deleting any trailing whitespace at the end of an encoded line (rule #3).
  4. removing the soft line breaks including the = prefix (i.e., concatenating broken sentences) (rule #5).

The standard imposes a maximum of 76 characters exclusive of CRLF; however, the decoder implemented in this component will handle lines of arbitrary length.

The decoder also provides support for two error-reporting modes, configurable at construction: the strict mode and the relaxed mode. A strict-mode decoder stops decoding at the first offending character encountered, while a relaxed-mode decoder continues decoding to the end of the input, allowing straight pass-through of character sets that cannot be interpreted.

Usage

This section illustrates intended use of this component.

Example 1: Encoding

The following example shows how to use a bdlde::QuotedPrintableEncoder object to implement a function, streamconverter, that reads text from a bsl::istream, encodes that text in Quoted-Printable representation, and write928s the encoded text to a bsl::ostream. streamconverter returns 0 on success, and a negative value if the input data could not be successfully encoded or if there is an I/O error.

streamconverter.h -*-C++-*-
/// Read the entire contents of the specified input stream `is`, convert
/// the input plain text to quoted-printable encoding, and write the
/// encoded text to the specified output stream `os`. Return 0 on
/// success, and a negative value otherwise.
int streamconverter(bsl::ostream& os, bsl::istream& is);

We will use fixed-sized input and output buffers in the implementation, but, because of the flexibility of bsl::istream and the output-buffer monitoring functionality of QuotedPrintableEncoder, the fixed buffer sizes do not limit the quantity of data that can be read, encoded, or written to the output stream. The implementation file is as follows.

streamconverter.cpp -*-C++-*-
#include <streamconverter.h>
namespace BloombergLP {
int streamconverter(bsl::ostream& os, bsl::istream& is)
{
enum {
SUCCESS = 0,
ENCODE_ERROR = -1,
IO_ERROR = -2
};

We declare a bdlde::QuotedPrintableEncoder object converter, which will encode the input data. Note that various internal buffers and cursors are used as needed without further comment. We read as much data as is available from the user-supplied input stream is or as much as will fit in inputBuffer before beginning conversion.

const int INBUFFER_SIZE = 1 << 10;
const int OUTBUFFER_SIZE = 1 << 10;
char inputBuffer[INBUFFER_SIZE];
char outputBuffer[OUTBUFFER_SIZE];
char *output = outputBuffer;
char *outputEnd = outputBuffer + sizeof outputBuffer;
while (is.good()) { // input stream not exhausted
is.read(inputBuffer, sizeof inputBuffer);

With inputBuffer now populated, we'll use converter in an inner while loop to encode the input and write the encoded data to outputBuffer (via the output cursor'). Note that if the call to converter.convert fails, our function terminates with a negative status.

const char *input = inputBuffer;
const char *inputEnd = input + is.gcount();
while (input < inputEnd) { // input encoding not complete
int numOut;
int numIn;
int status = converter.convert(output, &numOut, &numIn,
input, inputEnd,
outputEnd - output);
if (status < 0) {
return ENCODE_ERROR; // RETURN
}
int convert(char *out, int *numOut, int *numIn, const char *begin, const char *end, int maxNumOut=-1)

If the call to converter.convert returns successfully, we'll see if the output buffer is full, and if so, write its contents to the user-supplied output stream os. Note how we use the values of numOut and numIn generated by convert to update the relevant cursors.

output += numOut;
input += numIn;
if (output == outputEnd) { // output buffer full; write data
os.write (outputBuffer, sizeof outputBuffer);
if (os.fail()) {
return IO_ERROR; // RETURN
}
output = outputBuffer;
}
}
}

We have now exited both the input and the "encode" loops. converter may still hold encoded output characters, and so we call converter.endConvert to emit any retained output. To guarantee correct behavior, we call this method in an infinite loop, because it is possible that the retained output can fill the output buffer. In that case, we solve the problem by writing the contents of the output buffer to os within the loop. The most likely case, however, is that endConvert will return 0, in which case we exit the loop and write any data remaining in outputBuffer to os. As above, if endConvert fails, we exit the function with a negative return status.

for (;;) {
int more =
converter.endConvert(output, &numOut, outputEnd - output);
if (more < 0) {
return ENCODE_ERROR; // RETURN
}
output += numOut;
if (!more) { // no more output
break;
}
assert (output == outputEnd); // output buffer is full
os.write (outputBuffer, sizeof outputBuffer); // write buffer
if (os.fail()) {
return IO_ERROR; // RETURN
}
output = outputBuffer;
}
if (output > outputBuffer) { // still data in output buffer; write it
// all
os.write(outputBuffer, output - outputBuffer);
}
return is.eof() && os.good() ? SUCCESS : IO_ERROR;
}
} // Close namespace BloombergLP
int endConvert(char *out, int *numOut, int maxNumOut=-1)

For ease of reading, we repeat the full content of the streamconverter.cpp file without interruption.

streamconverter.cpp -*-C++-*-
#include <streamconverter.h>
namespace BloombergLP {
int streamconverter(bsl::ostream& os, bsl::istream& is)
{
enum {
SUCCESS = 0,
ENCODE_ERROR = -1,
IO_ERROR = -2
};
const int INBUFFER_SIZE = 1 << 10;
const int OUTBUFFER_SIZE = 1 << 10;
char inputBuffer[INBUFFER_SIZE];
char outputBuffer[OUTBUFFER_SIZE];
char *output = outputBuffer;
char *outputEnd = outputBuffer + sizeof outputBuffer;
while (is.good()) { // input stream not exhausted
is.read(inputBuffer, sizeof inputBuffer);
const char *input = inputBuffer;
const char *inputEnd = input + is.gcount();
while (input < inputEnd) { // input encoding not complete
int numOut;
int numIn;
int status = converter.convert(output, &numOut, &numIn,
input, inputEnd,
outputEnd - output);
if (status < 0) {
return ENCODE_ERROR; // RETURN
}
output += numOut;
input += numIn;
if (output == outputEnd) { // output buffer full; write data
os.write(outputBuffer, sizeof outputBuffer);
if (os.fail()) {
return IO_ERROR; // RETURN
}
output = outputBuffer;
}
}
}
for (;;) {
int more =
converter.endConvert(output, &numOut, outputEnd - output);
if (more < 0) {
return ENCODE_ERROR; // RETURN
}
output += numOut;
if (!more) { // no more output
break;
}
assert (output == outputEnd); // output buffer is full
os.write (outputBuffer, sizeof outputBuffer); // write buffer
if (os.fail()) {
return IO_ERROR; // RETURN
}
output = outputBuffer;
}
if (output > outputBuffer) {
os.write (outputBuffer, output - outputBuffer);
}
return is.eof() && os.good() ? SUCCESS : IO_ERROR;
}
} // Close namespace BloombergLP