Quick Links:

bal | bbl | bdl | bsl

Namespaces

Component bslstl_boyermoorehorspoolsearcher
[Package bslstl]

Provide an STL-compliant boyer_moore_horspool_searcher class. More...

Namespaces

namespace  bslstl
namespace  bsl

Detailed Description

Outline
Purpose:
Provide an STL-compliant boyer_moore_horspool_searcher class.
Classes:
bsl::boyer_moore_horspool_searcher class template to search via BMH
bslstl::BoyerMooreHorspoolSearcher class template to search via BMH
Canonical Header:
bsl_functional.h
See also:
Component bslstl_defaultsearcher
Description:
This component defines two class templates, bsl::boyer_moore_horspool_searcher and bslstl::BoyerMooreHorspoolSearcher. Both are compliant with section [func.search.bmh] of the C++ Standard (C++17 and later).
bsl::boyer_moore_horspool_searcher is strictly limited to the Standard and is provided for clients for whom standard compliance is a priority. bslslt::BoyerMoreHorspoolSearcher provides several accessors that are not mentioned in the Standard. Moreover, bslslt::BoyerMoreHorspoolSearcher is "plumbed" for BDE allocators and can be used with BDE standard containers whereas the compliant strict class always uses the currently installed default allocator. See Example 4 below.
Except where there is a relevant difference, both are described below as if they were one.
This class has several template parameters:
RNDACC_ITR_NEEDLE:
The type used to specify (on construction) the range of values being sought (the "needle").
HASH:
The functor type used to hash metadata for the unique values of the needle. See Requirements for HASH and EQUAL.
EQUAL:
The functor type used to compare values when storing/accessing needle metadata. See Requirements for HASH and EQUAL.
The class also provides a functor-style interface that accepts two iterators that define the range of values to be searched (the "haystack"). Once constructed, a single searcher object can be re-used to search multiple haystacks (for the same needle value).
The iterators defining the haystack need not be of the same type as those that define the needle. Moreover, the search method of the searcher can be overloaded for an arbitrary number of different haystack iterators (subject to Iterator Requirements).
Algorithm:
The search algorithm used is an implementation of the well-known Boyer, Moore, Horspool (BMH) Algorithm for string matching (see https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm). In the typical case, this algorithm offers complexity of O(N) for a haystack of length N.
Iterator Requirements:
The two independent iterators types associated with this class -- one defining the needle, the other defining the haystack -- must meet the requirements of RandomAccessIterator.
Additionally:
  • The iterators can be constant.
  • When dereferenced, both iterator types must refer to the same value type.
The operations of either of the iterator types are allowed to throw exceptions.
Iterators defining needles are required to remain valid as long as the searcher object might be used.
Requirements for HASH and EQUAL:
The comparer class, EQUAL, must meet the requirements of BinaryPredicate:
  • The class defines an operator() method that, given a *RandomAccessIterator', iterator, can be invoked as operator()(*iterator, *iterator).
  • The return value must be contextually convertible to bool.
  • The supplied iterators can be constant.
  • The class must be copyable.
Comparer classes are allowed to throw exceptions.
The behavior is undefined unless two values that are deemed equal by the EQUAL functor generate the same value by the HASH functor. That is:
  true == searcher.equal()(a, b);
implies:
  searcher.hash()(a) == searcher.hash()(b);
Optimizations for bslstl::BoyerMooreHorspoolSearcher:
This implementation handles needle metadata using a fixed size array when the value_type is char (either signed or unsigned flavors). For needles of typical size, this choice results in somewhat more memory use than it would have if some dynamically sized container were used; however, the faster access during searches warrants the tradeoff.
Usage:
In this section we show the intended usage of this component.
Example 1: Basic Usage:
The problem of searching a sequence of characters for a particular sub-sequence arises in many applications. For example, one might need to know if some document contains a particular word of interest. For example, suppose we would like to know the first occurrence of the word "United" in the Declaration of Independence (of the United States):
First, we obtain the text of document and word of interest as sequences of char values.
  const char document[] =
  " IN CONGRESS, July 4, 1776.\n"                // 28
  "\n"                                           //  1
  " The unanimous Declaration of the thirteen united States of America,\n"
 //----^----|----^----|----^----|----^----|----  // 44
 //                                              // --
 //                                              // 73rd character
 //
  "\n"
  // ...
  " declare, That these United Colonies are, and of Right ought to be\n"
  // ...
  "Honor.";

  const char *word = "United";
Then, we create a default_searcher object (a functor) using the given word:
  bsl::boyer_moore_horspool_searcher<const char *> searchForUnited(
                                                          word,
                                                          word
                                                        + bsl::strlen(word));
Notice that no equality comparison functor was specified so searchForUnited will use bsl::equal_to<char> by default.
Now, we invoke our functor, specifying the range of the document to be searched:
  bsl::pair<const char *, const char *> result = searchForUnited(
                                                            document,
                                                            document
                                                          + sizeof document);

  bsl::size_t offset = result.first - document;

  assert(120 == offset);
  assert(static_cast<bsl::size_t>(result.second - result.first)
             == bsl::strlen(word));
Finally, we notice that the search correctly ignored the appearance of the word "united" (all lower case) in the second sentence.
bslstl_defaultsearcher|Example 1 shows how the same problem is addressed using bsl::default_searcher.
Example 2: Defining a Comparator and Hash:
As seen in Example 1 above, the default equality comparison functor is case sensitive. If one needs case in-sensitive searches, a non-default equality comparison class and a non-default hash functor must be specified.
First, define (at file scope if using a pre-C++11 compiler) an equality comparison class that provides the required functor interface:
  struct MyCaseInsensitiveCharComparer {
      bool operator()(const char& a, const char& b) const {
          return bsl::tolower(a) == bsl::tolower(b);
      }
  };
Then, define (again at file scope, if pre-C++11), a hash functor so that two values, irrespective of their case, hash to the same value:
  struct MyCaseInsensitiveCharHasher {
      bool operator()(const char& value) const {
          static bsl::hash<char> s_hash;
          return s_hash(static_cast<char>(bsl::tolower(value)));
      }
  };
Now, specify bsl::boyer_moore_horspool_searcher type for and create a searcher object to search for word:
  bsl::boyer_moore_horspool_searcher<const char *,
                                     MyCaseInsensitiveCharHasher,
                                     MyCaseInsensitiveCharComparer>
                                                  searchForUnitedInsensitive(
                                                          word,
                                                          word
                                                        + bsl::strlen(word));
Note that the new searcher object will use defaulted constructed MyCaseInsensitiveCharHasher and MyCaseInsensitiveCharComparer classes. If stateful functors are required such objects can be passed in the optional constructor arguments.
Now, we invoke our searcher functor, specifying that the same document searched in Example 1:
  bsl::pair<const char *, const char *> resultInsensitive =
                                                  searchForUnitedInsensitive(
                                                            document,
                                                            document
                                                          + sizeof document);

  bsl::size_t offsetInsensitive = resultInsensitive.first - document;

  assert( 72 == offsetInsensitive);
  assert(static_cast<bsl::size_t>(resultInsensitive.second
                                - resultInsensitive.first)
             == bsl::strlen(word));
Finally, we find the next occurrence of word by reusing the same searcher object, this time instructing it to begin its search just after the previous occurrence of word was found:
  resultInsensitive = searchForUnitedInsensitive(resultInsensitive.second,
                                                 document + sizeof document);

  offsetInsensitive = resultInsensitive.first - document;

  assert(120 == offsetInsensitive);
  assert(static_cast<bsl::size_t>(resultInsensitive.second
                                - resultInsensitive.first)
             == bsl::strlen(word));
bslstl_defaultsearcher|Example 2 shows how the same problem is addressed using bsl::default_searcher.
Example 3: Non-'char' Searches:
The BMH searcher class template is not constrained to searching for char values. Searches can be done on other types (see Iterator Requirements). Moreover the container of the sequence being sought (the "needle") need not the same as the sequence being searched (the "haystack").
Suppose one has data from an instrument that reports float values and that inserts the sequence { FLT_MAX, FLT_MIN, FLT_MAX } as a marker for the start and end of a test run. We can assume the probably of the instrument reporting this sequence as readings is negligible and that data reported outside of the test runs is random noise. Here is how we can search for the first test run data in the data sequence.
First, we create a representation of the sequence that denotes the limit of a test run.
  const float       markerSequence[]     = { FLT_MAX , FLT_MIN , FLT_MAX };
  const bsl::size_t markerSequenceLength = sizeof  markerSequence
                                         / sizeof *markerSequence;
Next, we obtain the data to be searched. (In this example, we will use simulated data.)
  bsl::vector<float> data;  // Container provides random access iterators.
  doTestRun(&data);
Then, we define and create our searcher object:
  bsl::boyer_moore_horspool_searcher<const float *>
                                       searchForMarker(markerSequence,
                                                       markerSequence
                                                     + markerSequenceLength);
Notice that no equality comparison functor was specified so searchForMarker will use bsl::equal_to<float> by default.
Now, we invoke our searcher on the instrument data.
  typedef bsl::vector<float>::const_iterator DataConstItr;

  const bsl::pair<DataConstItr, DataConstItr> notFound(data.cend(),
                                                       data.cend());

  bsl::pair<DataConstItr, DataConstItr> markerPosition = searchForMarker(
                                                               data.cbegin(),
                                                               data.cend());

  assert(notFound != markerPosition);

  DataConstItr startOfTestRun = markerPosition.second;
Finally, we locate the marker of the end of the first test run and pass the location of the first test run data to some other function for processing.
  markerPosition = searchForMarker(markerPosition.second, data.cend());

  assert(notFound != markerPosition);

  DataConstItr endOfTestRun = markerPosition.first;

  processTestRun(startOfTestRun, endOfTestRun);
bslstl_defaultsearcher|Example 3 shows how the same problem is addressed using bsl::default_searcher. Notice that other example uses data from a container that provides bidirectional iterators (and forward iterators would have sufficed), whereas here random access iterators are required.
Example 4: Caching Searcher Objects:
The construction of bsl::boyer_moore_horspool_searcher objects is small (the needle must be scanned, meta-data calculated, and results saved) but can be non-neglibible when one needs a great number of them. When there is a reasonable chance that one will have to repeat a given search, it can be worthwhile to cache the searcher objects for reuse.
The Problem:
Suppose we have a long list of names, each consisting of a given name (first name) and a surname (last name), and that we wish to identify instances of reduplication of the given name in the surname. That is, we want to identify the cases where the given name is a case-insensitive substring of the surname. Examples include "Durand Durand", "James Jameson", "John St. John", and "Jean Valjean". In this example we will not accept nicknames and other approximate name forms as matches (e.g., "Joe Joseph", "Jack Johnson").
Since we want to perform our task as efficiently as possible, and since we expect many entries to have common given names (e.g., "John"), we decide to create a cache of searcher objects for those first names so they need not be reconstructed for each search of a surname.
Design Choices:
To implement our cache we will use a bsl::unordered_map container. Allocating types must meet certain requirements to work properly with allocator-enabled containers such as bsl::unordered_map. bsl::boyer_moore_horspool_searcher does not, so we will use bslstl::BoyerMooreHorspoolSearcher, which does.
To clarify exposition, our cache will have the simple policy of retaining searcher objects indefinitely and ignore the real-world concern that our cache may grow so large that the search time exceeds construction time. Also, we will forgo techniques that might minimize the number of times data is copied.
Steps:
First, we define our cache class:
                      // ====================================
                      // class MyCaseInsensitiveSearcherCache
                      // ====================================

  class MyCaseInsensitiveSearcherCache {

      // TYPES
    public:
      typedef bslstl::BoyerMooreHorspoolSearcher<
                                               bsl::string::const_iterator,
                                               MyCaseInsensitiveCharHasher,
                                               MyCaseInsensitiveCharComparer>
                                                                    Searcher;
      // PRIVATE TYPES
    private:
      typedef bsl::unordered_map<bsl::string, Searcher> Map;

      // DATA
      Map d_map;

      // PRIVATE MANIPULATORS
      const Searcher& insertSearcher(const bsl::string& key);
          // Insert into this cache a key-value pair where the key is the
          // specified 'key' and the value is a 'Searcher' object created to
          // seek the needle specified by the key part.  Note that this
          // arrangement guarantees that the iterators used by this cached
          // searcher object remain valid for the life of the searcher
          // object.

    public:
      // CREATORS
      explicit MyCaseInsensitiveSearcherCache(bslma::Allocator
                                                        *basicAllocator = 0);
          // Create an empty 'MyCaseInsensitiveSearcherCache' object.
          // Optionally specify a 'basicAllocator' used to supply memory.  If
          // 'basicAllocator' is 0, the currently installed default
          // allocator is used.

      // MANIPULATORS
      const Searcher& getSearcher(const char *needle);
          // Return a 'const'-reference to the cached server that can do a
          // case-insensitive search for the specified 'needle'.  If such a
          // searcher does not exist in the cache on entry, such a searcher
          // is constructed, added to the cache, and returned (by
          // 'const'-reference).

      // ACCESSORS
      bsl::size_t numSearchers() const;
          // Return the number of searcher objects in this cache.
  };
Notice (see the typedef for Searcher) that we reuse the hash functor, MyCaseInsensitiveCharHasher, and equality comparison functor, MyCaseInsensitiveCharComparer, that were defined in Example 2.
Note that MyCaseInsensitiveSearcherCache itself is an allocating type. If we needed to make it compatible with BDE containers (e.g., to allow a cache of caches) a few additional features are needed. As we have no such need, those features are deferred.
Then, we implement the constructor:
                      // ------------------------------------
                      // class MyCaseInsensitiveSearcherCache
                      // ------------------------------------

  // CREATORS
  MyCaseInsensitiveSearcherCache::
  MyCaseInsensitiveSearcherCache(bslma::Allocator *basicAllocator)
  : d_map(basicAllocator)
  {
  }
Notice that basicAllocator is simply forwarded to d_map.
Next, we implement the public methods:
  // MANIPULATORS
  const
  MyCaseInsensitiveSearcherCache::Searcher&
  MyCaseInsensitiveSearcherCache::getSearcher(const char *needle)
  {
      bsl::string   key(needle);
      Map::iterator findResult = d_map.find(key);

      if (d_map.end() == findResult) {
          return insertSearcher(key);                               // RETURN
      } else {
          return findResult->second;                                // RETURN
      }
  }

  // ACCESSORS
  bsl::size_t MyCaseInsensitiveSearcherCache::numSearchers() const
  {
      return d_map.size();
  }
Then, to complete our class, we implement the cache class private method:
  // PRIVATE MANIPULATORS
  const
  MyCaseInsensitiveSearcherCache::Searcher&
  MyCaseInsensitiveSearcherCache::insertSearcher(const bsl::string& key)
  {
      Searcher        dummy(key.begin(), key.begin()); // to be overwritten
      Map::value_type value(key, dummy);

      bsl::pair<Map::iterator, bool> insertResult = d_map.insert(value);
      assert(true == insertResult.second);

      Map::iterator iterator = insertResult.first;

      iterator->second = Searcher(iterator->first.begin(),
                                  iterator->first.end());
      return iterator->second;
  }
Notice creating our element is a two step process. First, we insert the key with an arbitrary "dummy" searcher. Once the key (a string) exists in the map (at an address that is stable for the life of the map) we create a searcher object that refers to that key string for its search sequence, and overwrite the "dummy" part of previously inserted element.
Now, we show how the searcher object cache can be used. In this example, a fixed array represents our source of name entries, in random order:
  struct {
      const char *d_givenName_p;
      const char *d_surname_p;
  } DATA[] = {
      // GIVEN     SURNAME
      // --------  ------------
      { "Donald" , "McDonald"    }
    , { "John"   , "Johnson"     }
    , { "John"   , "Saint Johns" }
    , { "Jon"    , "Literjon"    }
    , { "Jean"   , "Valjean"     }
    , { "James"  , "Jameson"     }
    , { "Will"   , "Freewill"    }
    , { "John"   , "Johns"       }
    , { "John"   , "John"        }
    , { "John"   , "Jones"       }
    , { "J'onn"  , "J'onzz"      }
    , { "Donald" , "Donalds"     }
    , { "Donald" , "Mac Donald"  }
    , { "William", "Williams"    }
    , { "Durand" , "Durand"      }
    , { "John"   , "Johnstown"   }
    , { "Major"  , "Major"       }
    , { "Donald" , "MacDonald"   }
    , { "Patrick", "O'Patrick"   }
    , { "Chris",   "Christie"    }
    , { "Don",     "London"      }
      // ...
    , { "Ivan"   , "Ivanovich"   }
  };

  bsl::size_t NUM_NAMES = sizeof DATA / sizeof *DATA;

  typedef bsl::pair<const char *, const char *> Result;

  MyAllocator                    myAllocator;
  MyCaseInsensitiveSearcherCache searcherCache(&myAllocator);
  bsl::string                    output;

  for (bsl::size_t ti = 0; ti < NUM_NAMES; ++ti) {
      const char * const givenName = DATA[ti].d_givenName_p;
      const char * const surname   = DATA[ti].d_surname_p;

      const MyCaseInsensitiveSearcherCache::Searcher& searcher =
                                        searcherCache.getSearcher(givenName);

      assert(&myAllocator == searcher.allocator());
Notice that each searcher object in the cache (correctly) uses the same allocator as we specified for the cache itself.
The rest of the application:
      const Result result   = searcher(surname,
                                       surname + bsl::strlen(surname));

      const Result notFound = bsl::make_pair(surname + bsl::strlen(surname),
                                       surname + bsl::strlen(surname));

      char buffer[32];

      if (notFound == result) {
          sprintf(buffer, "ng: %-10s %-11s\n", givenName, surname);
      } else {
          sprintf(buffer, "OK: %-10s %-11s\n", givenName, surname);
      }

      output.append(buffer);
  }
Finally, we examine the collected output and confirm that our code is properly identifying the names of interest.
  assert(0 == bsl::strcmp(output.c_str(),
                          "OK: Donald     McDonald   \n"
                          "OK: John       Johnson    \n"
                          "OK: John       Saint Johns\n"
                          "OK: Jon        Literjon   \n"
                          "OK: Jean       Valjean    \n"
                          "OK: James      Jameson    \n"
                          "OK: Will       Freewill   \n"
                          "OK: John       Johns      \n"
                          "OK: John       John       \n"
                          "ng: John       Jones      \n"
                          "ng: J'onn      J'onzz     \n"
                          "OK: Donald     Donalds    \n"
                          "OK: Donald     Mac Donald \n"
                          "OK: William    Williams   \n"
                          "OK: Durand     Durand     \n"
                          "OK: John       Johnstown  \n"
                          "OK: Major      Major      \n"
                          "OK: Donald     MacDonald  \n"
                          "OK: Patrick    O'Patrick  \n"
                          "OK: Chris      Christie   \n"
                          "OK: Don        London     \n"
                          "OK: Ivan       Ivanovich  \n"));

  assert(searcherCache.numSearchers() < NUM_NAMES);