Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bdlb_tokenizer
[Package bdlb]

Provide access to user-described tokens via string references. More...

Namespaces

namespace  bdlb

Detailed Description

Outline
Purpose:
Provide access to user-described tokens via string references.
Classes:
bdlb::Tokenizer lexer for tokens defined via hard and/or soft delimiters
bdlb::TokenizerIterator input iterator for delimited tokens in a string
See also:
Component bslstl_stringref
Description:
This component defines a mechanism, bdlb::Tokenizer, that provides non-destructive sequential (read-only) access to tokens in a given input string as characterized by two disjoint sets of user-specified delimiter characters, each of which is supplied at construction via either a const bsl::string_view& or (for efficiency, when only the leading characters of the input string may need to be parsed) a const char *. Note that each character (including \0) that is not explicitly designated as a delimiter character is assumed to be token character.
Soft versus Hard Delimiters:
The tokenizer recognizes two distinct kinds of delimiter characters, soft and hard.
A soft delimiter is a maximal (non-empty) sequence of soft-delimiter characters. Soft delimiters, typically whitespace characters, are used to separate (rather than terminate) tokens, and thus never result in an empty token.
A hard delimiter is a maximal (non-empty) sequence of delimiter characters consisting of exactly one hard-delimiter character. Hard delimiters, typically printable punctuation characters such (/) or colon (: ), are used to terminate (rather than just separate) tokens, and thus a hard delimiter that is not preceded by a token character results in an empty token.
Soft delimiters are used in applications where multiple consecutive delimiter characters are to be treated as just a single delimiter. For example, if we want the input string "Sticks and stones" to parse into a sequence of three non-empty tokens ["Sticks", "and", "stones"], rather than the four-token sequence ["Sticks", "", "and", "stones"], we would make the space (' ') a soft-delimiter character.
Hard delimiters are used in applications where consecutive delimiter characters are to be treated as separate delimiters, giving rise to the possibility of empty tokens. Making the slash (/) in the standard date format a hard delimiter for the input string "15//9" yields the three-token sequence ["15", "", "9"], rather than the two-token one ["15", "9"] had it been made soft.
All members within each respective character set are considered equivalent with respect to tokenization. For example, making / and : soft delimiter characters on the questionably formatted date "2015/:10:/31" would yield the token sequence ["2015", "10", "31"], whereas making / and : hard delimiter characters would result in the token sequence ["2015", "", "10", "", "31"]. Making either of these two delimiter characters hard and the other soft would, in this example, yield the former (shorter) sequence of tokens. The details of how soft and hard delimiters interact is illustrated in more detail in the following section (but also see, later on, the section on "Comprehensive Detailed Parsing Specification").
The Input String to be Tokenized:
Each input string consists of an optional leading sequence of soft-delimiter characters called the leader, followed by an alternating sequence of tokens and delimiters (the final delimiter being optional):
  Input String:
  +--------+---------+-------------+---...---+---------+-------------+
  | leader | token_1 | delimiter_1 |         | token_N | delimiter_N |
  +--------+---------+-------------+---...---+---------+-------------+
  (optional)                                              (optional)
The tokenization of a string can also be expressed as pseudo-Posix regular expression notation:
   delimiter = [[:soft:]]+ | [[:soft:]]* [[:hard:]] [[:soft:]]*
   token     = [^[:soft:][:hard:]]*
   string    = [[:soft:]]* (token delimiter)* token?
Parsing is from left to right and is greedy -- i.e., the longest sequence satisfying the regular expression is the one that matches. For example, let s represent the start of a soft delimiter, d the start of a hard delimiter, ^" the start of a token, and '~</code> the continuation of that same delimiter or token. Using <code>.</code> as a soft delimiter and "/" as a hard one, the string
         s~   h~  h~~  h~     s~   hh  s  h~h    h~~~        Delimiters

        "..One/.if./.by./land,..two//if.by/./sea!./.."

           ^~~  ^~   ^~  ^~~~   ^~~ ^^~ ^~  ^^~~             Tokens
                                    |       |
                                 (empty)  (empty)
yields the tokenization
     [One]  [if]   [by]  [land,]  [two] [] [if] [by]  [] [sea]       Tokens

  (..)   (/.)  (./.)  (./)     (..)   (/)(/)  (.)  (/.)(/)   (./..)  Delims
Notice that in pair of hard delimiters "/./" before the token "sea", the soft token character between the two hard ones binds to the earlier delimiter.
Iterating using a TokenizerIterator object (ACCESS TO TOKENS ONLY):
This component provides two separate mechanisms by which a user may iterate over a sequence of tokens. The first mechanism is as a token range, exposed by the TokenizerIterator objects returned by the begin and end methods on a Tokenizer object. A TokenizerIterator supports the concept of a standard input iterator, returning each successive token as a bslstl::StringRef, making it suitable for generic use -- e.g., in a range-based for loop:
  void parse_1(bsl::ostream& output, const char *input)
      // Print, to the specified 'output' stream, each whitespace-delimited
      // token in the specified 'input; string on a separate line following
      // a vertical bar ('|') and a hard space (' ').
  {
      const char softDelimiters[] = " \t\n";  // whitespace

      for (bslstl::StringRef token : bdlb::Tokenizer(input, softDelimiters)) {
          bsl::cout << "| " << token << bsl::endl;
      }
  }
The parse_1 function above produces each (non-whitespace) token in the supplied input string on a separate line. So, were parse_1 to be given a reference to bsl::cout and the input string
  " Times  like \tthese\n  try \n \t men's\t \tsouls.\n"
we would expect
  | Times
  | like
  | these
  | try
  | men's
  | souls.
to be displayed on bsl::cout. Note that there is no way to access the delimiters from a TokenizerIterator directly, for that we will need to use the tokenizer as a non-standard "iterator" directly.
Iterating using a Tokenizer object (ACCESS TO TOKENS AND DELIMITERS):
The second mechanism, not intended for generic use, provides direct access to the previous and current (trailing) delimiters as well as the current token:
  void parse_2(bsl::ostream& output, const char *input)
      // Print, to the specified 'output' stream the leader of the specified
      // 'input', on a singly line, followed by subsequent current token and
      // (trailing) delimiter pairs on successive lines, each line beginning
      // with a vertical bar ('|') followed by a tab ('\t') character.
  {
      const char softDelimiters[] = " ";
      const char hardDelimiters[] = ":/";

      bdlb::Tokenizer it(input, softDelimiters, hardDelimiters);
      output << "| " << '"' << it.previousDelimiter() << '"' << "\n";

      for (; it.isValid(); ++it) {
          output << "|\t"
                 << '"' << it.token() << '"'
                 << "\t"
                 << '"' << it.trailingDelimiter() << '"'
                 << "\n";
      }
  }
The parse_2 function above produces the leader on the first line, followed by each token along with its current (trailing) delimiter on successive lines. So, were parse_2 to be given a reference to bsl::cout and the input string
  " I've :been: a : :bad:/ boy! / "
we would expect
  |       " "
  |       "I've"  " :"
  |       "been"  ": "
  |       "a :"   " : "
  |       ""      ":"
  |       "bad"   ":"
  |       ""      "/ "
  |       "boy!"  " / "
to be displayed on bsl::cout.
Token and Delimiter Lifetimes:
All tokens and delimiters are returned efficiently by value as bslstl::StringRef objects, which naturally remain valid so long as the underlying input string remains unchanged -- irrespective of the validity of the tokenizer or any of its dispensed token iterators. Note, however, that all such token iterators are invalidated if the parent tokenizer object is destroyed or reset. Note also the previous delimiter field remains accessible from a tokenizer object even after it has reached the end of its input. Also note that the leader is accessible, using the previousDelimiter method prior to advancing the iteration state of the Tokenizer.
Comprehensive Detailed Parsing Specification:
This section provides a comprehensive (length-ordered) enumeration of how the bdlb::Tokenizer performs, according to its three (non-null) character types:
  '.' = any *soft* delimiter character
  '#' = any *hard* delimiter character
  'T' = any token character
Here's how iteration progresses for various input strings. Note that input strings having consecutive characters of the same category that naturally coalesce (i.e., behave as if they were a single character of that category) -- namely soft-delimiter or token characters -- are labeled with (%). For example, consider the input ".." at the top of the [length 2] section below. The table indicates, with a (%) in the first column, that the input acts the same as if it were a single (soft-delimiter) character (i.e., "."). There is only one line in this row of the table because, upon construction, the iterator is immediately invalid (as indicated by the right-most column). Now consider the "##" entry near the bottom of [length 2]. These (hard-delimiter) tokens do not coalesce. What's more, the iterator on construction is valid and produces a empty leader and empty first token. after advancing the tokenizer, the second line of that row shows the current state of iteration with the previous delimiter being a # as well as the current one. The current token is again shown as empty. After advancing the tokenizer again, we now see that the iterator is invalid, yet the previous delimiter (still accessible) is a #).
  (%) = repeat   Previous   Current   Current   Iterator
  Input String   Delimiter   Token   Delimiter   Status
  ============   =========  =======  =========  ========  [length 0]
  ""             ""         na       na         invalid

  ============   =========  =======  =========  ========  [length 1]
  "."            "."        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#"            ""         ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T"            ""         "T"      ""         valid
                 ""         na       na         invalid

  ============   =========  =======  =========  ========  [length 2]
  ".."     (%)   ".."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".#"           "."        ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".T"           "."        "T"      ""         valid
                 ""         na       na         invalid

  ------------   ---------  -------  ---------  --------
  "#."           ""         ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "##"           ""         ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#T"           ""         ""       "#"        valid
                 "#"        "T"      ""         valid
                 ""         na       na         invalid

  ------------   ---------  -------  ---------  --------
  "T."           ""         "T"      "."        valid
                 "."        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T#"           ""         "T"      "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "TT"     (%)   ""         "TT"     ""         valid
                 ""         na       na         invalid

  ============   =========  =======  =========  ========  [length 3]
  "..."    (%)   "..."      na       na         invalid
  ------------   ---------  -------  ---------  --------
  "..#"    (%)   ".."       ""       "#"        valid
                   "#"      na       na         invalid
  ------------   ---------  -------  ---------  --------
  "..T"    (%)   ".."       "T"      ""         valid
                 ".."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".#."          "."        ""       "#."       valid
                 "#."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".##"          "."        ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".#T"          "."        ""       "#"        valid
                 "#"        "T"      ""         valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".T."          "."        "T"      "."        valid
                 "."        na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".T#"          "."        "T"      "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  ".TT"    (%)   "."        "TT"     ""         valid
                 ""         na       na         invalid

  ------------   ---------  -------  ---------  --------
  "#.."    (%)   ""         ""       "#.."      invalid
                 "#.."      na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#.#"          ""         ""       "#."       valid
                 "#."       ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#.T"          ""         ""       "#."       valid
                 "#."       "T"      ""         valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
  "##."          ""         ""       "#"        valid
                 "#"        ""       "#."       valid
                 "#."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  "###"          ""         ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "##T"          ""         ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        "T"      ""         valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#T."          ""         ""       "#"        valid
                 "#"        "T"      "."        valid
                 "."        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#T#"          ""         ""       "#"        valid
                 "#"        "T"      "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "#TT"    (%)   ""         ""       "#"        valid
                 "#"        "TT"     ""         valid
                 ""         na       na         invalid

  ------------   ---------  -------  ---------  --------
  "T.."    (%)   ""         "T"      ".."       valid
                 ".."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T.#"          ""         "T"      ".#"       valid
                 ".#"       na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T.T"          ""         "T"      "."        valid
                 "."        "T"      ""         valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T#."          ""         "T"      "#."       valid
                 "#."       na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T##"          ""         "T"      "#"        valid
                 "#"        ""       "#"        valid
                 "#"        ""       "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "T#T"          ""         "T"      "#"        valid
                 "#"        "T"      "#"        valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
  "TT."    (%)   ""         "TT"     "."        valid
                 "."        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "TT#"    (%)   ""         "TT"     "#"        valid
                 "#"        na       na         invalid
  ------------   ---------  -------  ---------  --------
  "TTT"    (%)   "#"        "TTT"    ""         valid
                 ""         na       na         invalid
  ------------   ---------  -------  ---------  --------
Usage:
This section illustrates intended use of this component.
Example 1: Iterating Over Tokens Using Just Soft Delimiters:
This example illustrates the process of splitting the input string into a sequence of tokens using just soft delimiters.
Suppose, we have a text where words are separated with a variable number of spaces and we want to remove all duplicated spaces.
First, we create an example character array:
  const char text1[] = "   This  is    a test.";
Then, we create a Tokenizer that uses " "(space) as a soft delimiter:
  bdlb::Tokenizer tokenizer1(text1, " ");
Note, that the tokenizer skips the leading soft delimiters upon initialization. Next, we iterate the input character array and build the string without duplicated spaces:
  bsl::string result1;
  if (tokenizer1.isValid()) {
      result1 += tokenizer1.token();
      ++tokenizer1;
  }
  while (tokenizer1.isValid()) {
      result1 += " ";
      result1 += tokenizer1.token();
      ++tokenizer1;
  }
Finally, we verify that the resulting string contains the expected result:
  const bsl::string EXPECTED1("This is a test.");
  assert(EXPECTED1 == result1);
Example 2: Iterating Over Tokens Using Just Hard Delimiters:
This example illustrates the process of splitting the input string into a sequence of tokens using just hard delimiters.
Suppose, we want to reformat comma-separated-value file and insert the default value of 0 into missing columns.
First, we create an example CSV line:
  const char text2[] = "Col1,Col2,Col3\n111,,133\n,222,\n311,322,\n";
Then, we create a Tokenizer that uses ","(comma) and "\n"(new-line) as hard delimiters:
  bdlb::Tokenizer tokenizer2(text2, "", ",\n");
We use the trailingDelimiter accessor to insert correct delimiter into the output string. Next, we iterate the input line and insert the default value:
  string result2;
  while (tokenizer2.isValid()) {
      if (tokenizer2.token() != "") {
          result2 += tokenizer2.token();
      } else {
          result2 += "0";
      }
      result2 += tokenizer2.trailingDelimiter();
      ++tokenizer2;
  }
Finally, we verify that the resulting string contains the expected result:
  const string EXPECTED2("Col1,Col2,Col3\n111,0,133\n0,222,0\n311,322,0\n");
  assert(EXPECTED2 == result2);
Example 3: Iterating Over Tokens Using Both Hard and Soft Delimiters:
This example illustrates the process of splitting the input string into a sequence of tokens using both soft and hard delimiters.
Suppose, we want to extract the tokens from a file, where the fields are separated with a "$"(dollar-sign), but can have leading or trailing spaces.
First, we create an example line:
  const char text3[] = " This $is    $   a$ test.      ";
Then, we create a Tokenizer that uses "$"(dollar-sign) as a hard delimiter and " "(space) as a soft delimiter:
  bdlb::Tokenizer tokenizer3(text3, " ", "$");
In this example we only extracting the tokens, and can use the iterator provided by the tokenizer.
Next, we create an iterator and iterate over the input, extracting the tokens into the result string:
  string result3;

  bdlb::Tokenizer::iterator it3 = tokenizer3.begin();

  if (it3 != tokenizer3.end()) {
      result3 += *it3;
  }
  ++it3;

  while (it3 != tokenizer3.end()) {
      result3 += " ";
      result3 += *it3;
      ++it3;
  }
Finally, we verify that the resulting string contains the expected result:
  const string EXPECTED3("This is a test.");
  assert(EXPECTED3 == result3);