BDE 4.14.0 Production release
|
Provide access to user-described tokens via string references.
This component defines a mechanism, bdlb::Tokenizer
, that provides non-destructive sequential (read-only) access to tokens in a given input string as characterized by two disjoint sets of user-specified delimiter characters, each of which is supplied at construction via either a const bsl::string_view&
or (for efficiency, when only the leading characters of the input string may need to be parsed) a const char *
. Note that each character (including '\0') that is not explicitly designated as a delimiter character is assumed to be token character.
The tokenizer recognizes two distinct kinds of delimiter characters, soft and hard.
A soft delimiter is a maximal (non-empty) sequence of soft-delimiter characters. Soft delimiters, typically whitespace characters, are used to separate (rather than terminate) tokens, and thus never result in an empty token.
A hard delimiter is a maximal (non-empty) sequence of delimiter characters consisting of exactly one hard-delimiter character. Hard delimiters, typically printable punctuation characters such (/
) or colon (:
), are used to terminate (rather than just separate) tokens, and thus a hard delimiter that is not preceded by a token character results in an empty token.
Soft delimiters are used in applications where multiple consecutive delimiter characters are to be treated as just a single delimiter. For example, if we want the input string "Sticks and stones"
to parse into a sequence of three non-empty tokens ["Sticks", "and", "stones"], rather than the four-token sequence ["Sticks", "", "and", "stones"], we would make the space () a soft-delimiter character.
Hard delimiters are used in applications where consecutive delimiter characters are to be treated as separate delimiters, giving rise to the possibility of empty tokens. Making the slash ( /
) in the standard date format a hard delimiter for the input string "15//9" yields the three-token sequence ["15", "", "9"], rather than the two-token one ["15", "9"] had it been made soft.
All members within each respective character set are considered equivalent with respect to tokenization. For example, making /
and :
soft delimiter characters on the questionably formatted date "2015/:10:/31" would yield the token sequence ["2015", "10", "31"], whereas making /
and :
hard delimiter characters would result in the token sequence ["2015", "", "10", "", "31"]. Making either of these two delimiter characters hard and the other soft would, in this example, yield the former (shorter) sequence of tokens. The details of how soft and hard delimiters interact is illustrated in more detail in the following section (but also see, later on, the section on "Comprehensive Detailed Parsing
Specification").
Each input string consists of an optional leading sequence of soft-delimiter characters called the leader, followed by an alternating sequence of tokens and delimiters (the final delimiter being optional):
The tokenization of a string can also be expressed as pseudo-Posix regular expression notation:
Parsing is from left to right and is greedy – i.e., the longest sequence satisfying the regular expression is the one that matches. For example, let s
represent the start of a soft delimiter, d
the start of a hard delimiter, ^
the start of a token, and ~
the continuation of that same delimiter or token. Using .
as a soft delimiter and /
as a hard one, the string
yields the tokenization
Notice that in pair of hard delimiters /./
before the token "sea", the soft token character between the two hard ones binds to the earlier delimiter.
This component provides two separate mechanisms by which a user may iterate over a sequence of tokens. The first mechanism is as a token range, exposed by the TokenizerIterator
objects returned by the begin
and end
methods on a Tokenizer
object. A TokenizerIterator
supports the concept of a standard input iterator, returning each successive token as a bslstl::StringRef
, making it suitable for generic use – e.g., in a range-based for
loop:
The parse_1
function above produces each (non-whitespace) token in the supplied input string on a separate line. So, were parse_1
to be given a reference to bsl::cout
and the input string
we would expect
to be displayed on bsl::cout
. Note that there is no way to access the delimiters from a TokenizerIterator
directly, for that we will need to use the tokenizer
as a non-standard "iterator" directly.
The second mechanism, not intended for generic use, provides direct access to the previous and current (trailing) delimiters as well as the current token:
The parse_2 function above produces the leader on the first line, followed by each token along with its current (trailing) delimiter on successive lines. So, were parse_2
to be given a reference to bsl::cout
and the input string
we would expect
to be displayed on bsl::cout
.
All tokens and delimiters are returned efficiently by value as bslstl::StringRef
objects, which naturally remain valid so long as the underlying input string remains unchanged – irrespective of the validity of the tokenizer
or any of its dispensed token iterators. Note, however, that all such token iterators are invalidated if the parent tokenizer object is destroyed or reset. Note also the previous delimiter field remains accessible from a tokenizer
object even after it has reached the end of its input. Also note that the leader is accessible, using the previousDelimiter
method prior to advancing the iteration state of the Tokenizer
.
This section provides a comprehensive (length-ordered) enumeration of how the bdlb::Tokenizer
performs, according to its three (non-null) character types:
Here's how iteration progresses for various input strings. Note that input strings having consecutive characters of the same category that naturally coalesce (i.e., behave as if they were a single character of that category) – namely soft-delimiter or token characters – are labeled with (%)
. For example, consider the input ".." at the top of the [length 2] section below. The table indicates, with a (%) in the first column, that the input acts the same as if it were a single (soft-delimiter) character (i.e., "."). There is only one line in this row of the table because, upon construction, the iterator is immediately invalid (as indicated by the right-most column). Now consider the "##" entry near the bottom of [length 2]. These (hard-delimiter) tokens do not coalesce. What's more, the iterator on construction is valid and produces a empty leader and empty first token. after advancing the tokenizer, the second line of that row shows the current state of iteration with the previous delimiter being a #
as well as the current one. The current token is again shown as empty. After advancing the tokenizer again, we now see that the iterator is invalid, yet the previous delimiter (still accessible) is a #
).
This section illustrates intended use of this component.
This example illustrates the process of splitting the input string into a sequence of tokens using just soft delimiters.
Suppose, we have a text where words are separated with a variable number of spaces and we want to remove all duplicated spaces.
First, we create an example character array:
Then, we create a Tokenizer
that uses " "(space) as a soft delimiter:
Note, that the tokenizer skips the leading soft delimiters upon initialization. Next, we iterate the input character array and build the string without duplicated spaces:
Finally, we verify that the resulting string contains the expected result:
This example illustrates the process of splitting the input string into a sequence of tokens using just hard delimiters.
Suppose, we want to reformat comma-separated-value file and insert the default value of 0
into missing columns.
First, we create an example CSV line:
Then, we create a Tokenizer
that uses ","(comma) and "\n"(new-line) as hard delimiters:
We use the trailingDelimiter
accessor to insert correct delimiter into the output string. Next, we iterate the input line and insert the default value:
Finally, we verify that the resulting string contains the expected result:
This example illustrates the process of splitting the input string into a sequence of tokens using both soft and hard delimiters.
Suppose, we want to extract the tokens from a file, where the fields are separated with a "$"(dollar-sign), but can have leading or trailing spaces.
First, we create an example line:
Then, we create a Tokenizer
that uses "$"(dollar-sign) as a hard delimiter and " "(space) as a soft delimiter:
In this example we only extracting the tokens, and can use the iterator provided by the tokenizer.
Next, we create an iterator and iterate over the input, extracting the tokens into the result string:
Finally, we verify that the resulting string contains the expected result: