BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlb_tokenizer

Detailed Description

Outline

Purpose

Provide access to user-described tokens via string references.

Classes

See also
bslstl_stringref

Description

This component defines a mechanism, bdlb::Tokenizer, that provides non-destructive sequential (read-only) access to tokens in a given input string as characterized by two disjoint sets of user-specified delimiter characters, each of which is supplied at construction via either a const bsl::string_view& or (for efficiency, when only the leading characters of the input string may need to be parsed) a const char *. Note that each character (including '\0') that is not explicitly designated as a delimiter character is assumed to be token character.

Soft versus Hard Delimiters

The tokenizer recognizes two distinct kinds of delimiter characters, soft and hard.

A soft delimiter is a maximal (non-empty) sequence of soft-delimiter characters. Soft delimiters, typically whitespace characters, are used to separate (rather than terminate) tokens, and thus never result in an empty token.

A hard delimiter is a maximal (non-empty) sequence of delimiter characters consisting of exactly one hard-delimiter character. Hard delimiters, typically printable punctuation characters such (/) or colon (: ), are used to terminate (rather than just separate) tokens, and thus a hard delimiter that is not preceded by a token character results in an empty token.

Soft delimiters are used in applications where multiple consecutive delimiter characters are to be treated as just a single delimiter. For example, if we want the input string "Sticks and stones" to parse into a sequence of three non-empty tokens ["Sticks", "and", "stones"], rather than the four-token sequence ["Sticks", "", "and", "stones"], we would make the space () a soft-delimiter character.

Hard delimiters are used in applications where consecutive delimiter characters are to be treated as separate delimiters, giving rise to the possibility of empty tokens. Making the slash ( / ) in the standard date format a hard delimiter for the input string "15//9" yields the three-token sequence ["15", "", "9"], rather than the two-token one ["15", "9"] had it been made soft.

All members within each respective character set are considered equivalent with respect to tokenization. For example, making / and : soft delimiter characters on the questionably formatted date "2015/:10:/31" would yield the token sequence ["2015", "10", "31"], whereas making / and : hard delimiter characters would result in the token sequence ["2015", "", "10", "", "31"]. Making either of these two delimiter characters hard and the other soft would, in this example, yield the former (shorter) sequence of tokens. The details of how soft and hard delimiters interact is illustrated in more detail in the following section (but also see, later on, the section on "Comprehensive Detailed Parsing Specification").

The Input String to be Tokenized

Each input string consists of an optional leading sequence of soft-delimiter characters called the leader, followed by an alternating sequence of tokens and delimiters (the final delimiter being optional):

Input String:
+--------+---------+-------------+---...---+---------+-------------+
| leader | token_1 | delimiter_1 | | token_N | delimiter_N |
+--------+---------+-------------+---...---+---------+-------------+
(optional) (optional)

The tokenization of a string can also be expressed as pseudo-Posix regular expression notation:

delimiter = [[:soft:]]+ | [[:soft:]]* [[:hard:]] [[:soft:]]*
token = [^[:soft:][:hard:]]*
string = [[:soft:]]* (token delimiter)* token?

Parsing is from left to right and is greedy – i.e., the longest sequence satisfying the regular expression is the one that matches. For example, let s represent the start of a soft delimiter, d the start of a hard delimiter, ^ the start of a token, and ~ the continuation of that same delimiter or token. Using . as a soft delimiter and / as a hard one, the string

s~ h~ h~~ h~ s~ hh s h~h h~~~ Delimiters
"..One/.if./.by./land,..two//if.by/./sea!./.."
^~~ ^~ ^~ ^~~~ ^~~ ^^~ ^~ ^^~~ Tokens
| |
(empty) (empty)

yields the tokenization

[One] [if] [by] [land,] [two] [] [if] [by] [] [sea] Tokens
(..) (/.) (./.) (./) (..) (/)(/) (.) (/.)(/) (./..) Delims

Notice that in pair of hard delimiters /./ before the token "sea", the soft token character between the two hard ones binds to the earlier delimiter.

Iterating using a TokenizerIterator object (ACCESS TO TOKENS ONLY)

This component provides two separate mechanisms by which a user may iterate over a sequence of tokens. The first mechanism is as a token range, exposed by the TokenizerIterator objects returned by the begin and end methods on a Tokenizer object. A TokenizerIterator supports the concept of a standard input iterator, returning each successive token as a bslstl::StringRef, making it suitable for generic use – e.g., in a range-based for loop:

/// Print, to the specified `output` stream, each whitespace-delimited
/// token in the specified `input`; string on a separate line following
/// a vertical bar ('|') and a hard space (' ').
void parse_1(bsl::ostream& output, const char *input)
{
const char softDelimiters[] = " \t\n"; // whitespace
for (bslstl::StringRef token : bdlb::Tokenizer(input, softDelimiters)) {
bsl::cout << "| " << token << bsl::endl;
}
}
Definition bslstl_stringref.h:372
Definition bdlb_algorithmworkaroundutil.h:74

The parse_1 function above produces each (non-whitespace) token in the supplied input string on a separate line. So, were parse_1 to be given a reference to bsl::cout and the input string

" Times like \tthese\n try \n \t men's\t \tsouls.\n"

we would expect

| Times
| like
| these
| try
| men's
| souls.

to be displayed on bsl::cout. Note that there is no way to access the delimiters from a TokenizerIterator directly, for that we will need to use the tokenizer as a non-standard "iterator" directly.

Iterating using a Tokenizer object (ACCESS TO TOKENS AND DELIMITERS)

The second mechanism, not intended for generic use, provides direct access to the previous and current (trailing) delimiters as well as the current token:

/// Print, to the specified `output` stream the leader of the specified
/// `input`, on a singly line, followed by subsequent current token and
/// (trailing) delimiter pairs on successive lines, each line beginning
/// with a vertical bar ('|') followed by a tab ('\t') character.
void parse_2(bsl::ostream& output, const char *input)
{
const char softDelimiters[] = " ";
const char hardDelimiters[] = ":/";
bdlb::Tokenizer it(input, softDelimiters, hardDelimiters);
output << "| " << '"' << it.previousDelimiter() << '"' << "\n";
for (; it.isValid(); ++it) {
output << "|\t"
<< '"' << it.token() << '"'
<< "\t"
<< '"' << it.trailingDelimiter() << '"'
<< "\n";
}
}
Definition bdlb_tokenizer.h:834

The parse_2 function above produces the leader on the first line, followed by each token along with its current (trailing) delimiter on successive lines. So, were parse_2 to be given a reference to bsl::cout and the input string

" I've :been: a : :bad:/ boy! / "

we would expect

| " "
| "I've" " :"
| "been" ": "
| "a :" " : "
| "" ":"
| "bad" ":"
| "" "/ "
| "boy!" " / "

to be displayed on bsl::cout.

Token and Delimiter Lifetimes

All tokens and delimiters are returned efficiently by value as bslstl::StringRef objects, which naturally remain valid so long as the underlying input string remains unchanged – irrespective of the validity of the tokenizer or any of its dispensed token iterators. Note, however, that all such token iterators are invalidated if the parent tokenizer object is destroyed or reset. Note also the previous delimiter field remains accessible from a tokenizer object even after it has reached the end of its input. Also note that the leader is accessible, using the previousDelimiter method prior to advancing the iteration state of the Tokenizer.

Comprehensive Detailed Parsing Specification

This section provides a comprehensive (length-ordered) enumeration of how the bdlb::Tokenizer performs, according to its three (non-null) character types:

'.' = any *soft* delimiter character
'#' = any *hard* delimiter character
'T' = any token character

Here's how iteration progresses for various input strings. Note that input strings having consecutive characters of the same category that naturally coalesce (i.e., behave as if they were a single character of that category) – namely soft-delimiter or token characters – are labeled with (%). For example, consider the input ".." at the top of the [length 2] section below. The table indicates, with a (%) in the first column, that the input acts the same as if it were a single (soft-delimiter) character (i.e., "."). There is only one line in this row of the table because, upon construction, the iterator is immediately invalid (as indicated by the right-most column). Now consider the "##" entry near the bottom of [length 2]. These (hard-delimiter) tokens do not coalesce. What's more, the iterator on construction is valid and produces a empty leader and empty first token. after advancing the tokenizer, the second line of that row shows the current state of iteration with the previous delimiter being a # as well as the current one. The current token is again shown as empty. After advancing the tokenizer again, we now see that the iterator is invalid, yet the previous delimiter (still accessible) is a #).

(%) = repeat Previous Current Current Iterator
Input String Delimiter Token Delimiter Status
============ ========= ======= ========= ======== [length 0]
"" "" na na invalid
============ ========= ======= ========= ======== [length 1]
"." "." na na invalid
------------ --------- ------- --------- --------
"#" "" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"T" "" "T" "" valid
"" na na invalid
============ ========= ======= ========= ======== [length 2]
".." (%) ".." na na invalid
------------ --------- ------- --------- --------
".#" "." "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
".T" "." "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"#." "" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"##" "" "" "#" valid
"#" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"#T" "" "" "#" valid
"#" "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"T." "" "T" "." valid
"." na na invalid
------------ --------- ------- --------- --------
"T#" "" "T" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"TT" (%) "" "TT" "" valid
"" na na invalid
============ ========= ======= ========= ======== [length 3]
"..." (%) "..." na na invalid
------------ --------- ------- --------- --------
"..#" (%) ".." "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"..T" (%) ".." "T" "" valid
".." na na invalid
------------ --------- ------- --------- --------
".#." "." "" "#." valid
"#." na na invalid
------------ --------- ------- --------- --------
".##" "." "" "#" valid
"#" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
".#T" "." "" "#" valid
"#" "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
".T." "." "T" "." valid
"." na na invalid
------------ --------- ------- --------- --------
".T#" "." "T" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
".TT" (%) "." "TT" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"#.." (%) "" "" "#.." invalid
"#.." na na invalid
------------ --------- ------- --------- --------
"#.#" "" "" "#." valid
"#." "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"#.T" "" "" "#." valid
"#." "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"##." "" "" "#" valid
"#" "" "#." valid
"#." na na invalid
------------ --------- ------- --------- --------
"###" "" "" "#" valid
"#" "" "#" valid
"#" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"##T" "" "" "#" valid
"#" "" "#" valid
"#" "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"#T." "" "" "#" valid
"#" "T" "." valid
"." na na invalid
------------ --------- ------- --------- --------
"#T#" "" "" "#" valid
"#" "T" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"#TT" (%) "" "" "#" valid
"#" "TT" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"T.." (%) "" "T" ".." valid
".." na na invalid
------------ --------- ------- --------- --------
"T.#" "" "T" ".#" valid
".#" na na invalid
------------ --------- ------- --------- --------
"T.T" "" "T" "." valid
"." "T" "" valid
"" na na invalid
------------ --------- ------- --------- --------
"T#." "" "T" "#." valid
"#." na na invalid
------------ --------- ------- --------- --------
"T##" "" "T" "#" valid
"#" "" "#" valid
"#" "" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"T#T" "" "T" "#" valid
"#" "T" "#" valid
"" na na invalid
------------ --------- ------- --------- --------
"TT." (%) "" "TT" "." valid
"." na na invalid
------------ --------- ------- --------- --------
"TT#" (%) "" "TT" "#" valid
"#" na na invalid
------------ --------- ------- --------- --------
"TTT" (%) "#" "TTT" "" valid
"" na na invalid
------------ --------- ------- --------- --------

Usage

This section illustrates intended use of this component.

Example 1: Iterating Over Tokens Using Just Soft Delimiters

This example illustrates the process of splitting the input string into a sequence of tokens using just soft delimiters.

Suppose, we have a text where words are separated with a variable number of spaces and we want to remove all duplicated spaces.

First, we create an example character array:

const char text1[] = " This is a test.";

Then, we create a Tokenizer that uses " "(space) as a soft delimiter:

bdlb::Tokenizer tokenizer1(text1, " ");

Note, that the tokenizer skips the leading soft delimiters upon initialization. Next, we iterate the input character array and build the string without duplicated spaces:

bsl::string result1;
if (tokenizer1.isValid()) {
result1 += tokenizer1.token();
++tokenizer1;
}
while (tokenizer1.isValid()) {
result1 += " ";
result1 += tokenizer1.token();
++tokenizer1;
}
Definition bslstl_string.h:1281

Finally, we verify that the resulting string contains the expected result:

const bsl::string EXPECTED1("This is a test.");
assert(EXPECTED1 == result1);

Example 2: Iterating Over Tokens Using Just Hard Delimiters

This example illustrates the process of splitting the input string into a sequence of tokens using just hard delimiters.

Suppose, we want to reformat comma-separated-value file and insert the default value of 0 into missing columns.

First, we create an example CSV line:

const char text2[] = "Col1,Col2,Col3\n111,,133\n,222,\n311,322,\n";

Then, we create a Tokenizer that uses ","(comma) and "\n"(new-line) as hard delimiters:

bdlb::Tokenizer tokenizer2(text2, "", ",\n");

We use the trailingDelimiter accessor to insert correct delimiter into the output string. Next, we iterate the input line and insert the default value:

string result2;
while (tokenizer2.isValid()) {
if (tokenizer2.token() != "") {
result2 += tokenizer2.token();
} else {
result2 += "0";
}
result2 += tokenizer2.trailingDelimiter();
++tokenizer2;
}

Finally, we verify that the resulting string contains the expected result:

const string EXPECTED2("Col1,Col2,Col3\n111,0,133\n0,222,0\n311,322,0\n");
assert(EXPECTED2 == result2);

Example 3: Iterating Over Tokens Using Both Hard and Soft Delimiters

This example illustrates the process of splitting the input string into a sequence of tokens using both soft and hard delimiters.

Suppose, we want to extract the tokens from a file, where the fields are separated with a "$"(dollar-sign), but can have leading or trailing spaces.

First, we create an example line:

const char text3[] = " This $is $ a$ test. ";

Then, we create a Tokenizer that uses "$"(dollar-sign) as a hard delimiter and " "(space) as a soft delimiter:

bdlb::Tokenizer tokenizer3(text3, " ", "$");

In this example we only extracting the tokens, and can use the iterator provided by the tokenizer.

Next, we create an iterator and iterate over the input, extracting the tokens into the result string:

string result3;
bdlb::Tokenizer::iterator it3 = tokenizer3.begin();
if (it3 != tokenizer3.end()) {
result3 += *it3;
}
++it3;
while (it3 != tokenizer3.end()) {
result3 += " ";
result3 += *it3;
++it3;
}
Definition bdlb_tokenizer.h:716

Finally, we verify that the resulting string contains the expected result:

const string EXPECTED3("This is a test.");
assert(EXPECTED3 == result3);