BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bslstl_boyermoorehorspoolsearcher

Detailed Description

Outline

Purpose

Provide an STL-compliant boyer_moore_horspool_searcher class.

Classes

Canonical header: bsl_functional.h

See also
bslstl_defaultsearcher

Description

This component defines two class templates, bsl::boyer_moore_horspool_searcher and bslstl::BoyerMooreHorspoolSearcher. Both are compliant with section [func.search.bmh] of the C++ Standard (C++17 and later).

bsl::boyer_moore_horspool_searcher is strictly limited to the Standard and is provided for clients for whom standard compliance is a priority. bslslt::BoyerMoreHorspoolSearcher provides several accessors that are not mentioned in the Standard. Moreover, bslslt::BoyerMoreHorspoolSearcher is "plumbed" for BDE allocators and can be used with BDE standard containers whereas the compliant strict class always uses the currently installed default allocator. See {Example 4} below.

Except where there is a relevant difference, both are described below as if they were one.

This class has several template parameters:

RNDACC_ITR_NEEDLE: The type used to specify (on construction) the range of values being sought (the "needle").

HASH: The functor type used to hash metadata for the unique values of the needle. See {Requirements for HASH and EQUAL}.

EQUAL: The functor type used to compare values when storing/accessing needle metadata. See {Requirements for HASH and EQUAL}.

The class also provides a functor-style interface that accepts two iterators that define the range of values to be searched (the "haystack"). Once constructed, a single searcher object can be re-used to search multiple haystacks (for the same needle value).

The iterators defining the haystack need not be of the same type as those that define the needle. Moreover, the search method of the searcher can be overloaded for an arbitrary number of different haystack iterators (subject to {Iterator Requirements}).

Algorithm

The search algorithm used is an implementation of the well-known Boyer, Moore, Horspool (BMH) Algorithm for string matching (see https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm). In the typical case, this algorithm offers complexity of O(N) for a haystack of length N.

Iterator Requirements

The two independent iterators types associated with this class – one defining the needle, the other defining the haystack – must meet the requirements of RandomAccessIterator.

Additionally:

The operations of either of the iterator types are allowed to throw exceptions.

Iterators defining needles are required to remain valid as long as the searcher object might be used.

Requirements for HASH and EQUAL

The comparer class, EQUAL, must meet the requirements of BinaryPredicate:

Comparer classes are allowed to throw exceptions.

The behavior is undefined unless two values that are deemed equal by the EQUAL functor generate the same value by the HASH functor. That is:

true == searcher.equal()(a, b);

implies:

searcher.hash()(a) == searcher.hash()(b);

Optimizations for bslstl::BoyerMooreHorspoolSearcher

This implementation handles needle metadata using a fixed size array when the value_type is char (either signed or unsigned flavors). For needles of typical size, this choice results in somewhat more memory use than it would have if some dynamically sized container were used; however, the faster access during searches warrants the tradeoff.

Usage

In this section we show the intended usage of this component.

Example 1: Basic Usage

The problem of searching a sequence of characters for a particular sub-sequence arises in many applications. For example, one might need to know if some document contains a particular word of interest. For example, suppose we would like to know the first occurrence of the word "United" in the Declaration of Independence (of the United States):

First, we obtain the text of document and word of interest as sequences of char values.

const char document[] =
" IN CONGRESS, July 4, 1776.\n" // 28
"\n" // 1
" The unanimous Declaration of the thirteen united States of America,\n"
//----^----|----^----|----^----|----^----|---- // 44
// // --
// // 73rd character
//
"\n"
// ...
" declare, That these United Colonies are, and of Right ought to be\n"
// ...
"Honor.";
const char *word = "United";

Then, we create a default_searcher object (a functor) using the given word:

bsl::boyer_moore_horspool_searcher<const char *> searchForUnited(
word,
word
+ bsl::strlen(word));

Notice that no equality comparison functor was specified so searchForUnited will use bsl::equal_to<char> by default.

Now, we invoke our functor, specifying the range of the document to be searched:

bsl::pair<const char *, const char *> result = searchForUnited(
document,
document
+ sizeof document);
bsl::size_t offset = result.first - document;
assert(120 == offset);
assert(static_cast<bsl::size_t>(result.second - result.first)
== bsl::strlen(word));
Definition bslstl_pair.h:1210
TYPE first
Definition bslstl_pair.h:605
TYPE second
Definition bslstl_pair.h:908

Finally, we notice that the search correctly ignored the appearance of the word "united" (all lower case) in the second sentence.

{bslstl_defaultsearcher |Example 1} shows how the same problem is addressed using bsl::default_searcher.

Example 2: Defining a Comparator and Hash

As seen in {Example 1} above, the default equality comparison functor is case sensitive. If one needs case in-sensitive searches, a non-default equality comparison class and a non-default hash functor must be specified.

First, define (at file scope if using a pre-C++11 compiler) an equality comparison class that provides the required functor interface:

struct MyCaseInsensitiveCharComparer {
bool operator()(const char& a, const char& b) const {
return bsl::tolower(a) == bsl::tolower(b);
}
};

Then, define (again at file scope, if pre-C++11), a hash functor so that two values, irrespective of their case, hash to the same value:

struct MyCaseInsensitiveCharHasher {
bool operator()(const char& value) const {
static bsl::hash<char> s_hash;
return s_hash(static_cast<char>(bsl::tolower(value)));
}
};
Definition bslstl_hash.h:498

Now, specify bsl::boyer_moore_horspool_searcher type for and create a searcher object to search for word:

bsl::boyer_moore_horspool_searcher<const char *,
MyCaseInsensitiveCharHasher,
MyCaseInsensitiveCharComparer>
searchForUnitedInsensitive(
word,
word
+ bsl::strlen(word));

Note that the new searcher object will use defaulted constructed MyCaseInsensitiveCharHasher and MyCaseInsensitiveCharComparer classes. If stateful functors are required such objects can be passed in the optional constructor arguments.

Now, we invoke our searcher functor, specifying that the same document searched in {Example 1}:

searchForUnitedInsensitive(
document,
document
+ sizeof document);
bsl::size_t offsetInsensitive = resultInsensitive.first - document;
assert( 72 == offsetInsensitive);
assert(static_cast<bsl::size_t>(resultInsensitive.second
- resultInsensitive.first)
== bsl::strlen(word));

Finally, we find the next occurrence of word by reusing the same searcher object, this time instructing it to begin its search just after the previous occurrence of word was found:

resultInsensitive = searchForUnitedInsensitive(resultInsensitive.second,
document + sizeof document);
offsetInsensitive = resultInsensitive.first - document;
assert(120 == offsetInsensitive);
assert(static_cast<bsl::size_t>(resultInsensitive.second
- resultInsensitive.first)
== bsl::strlen(word));

{bslstl_defaultsearcher |Example 2} shows how the same problem is addressed using bsl::default_searcher.

Example 3: Non-char Searches

The BMH searcher class template is not constrained to searching for char values. Searches can be done on other types (see {Iterator Requirements}). Moreover the container of the sequence being sought (the "needle") need not the same as the sequence being searched (the "haystack").

Suppose one has data from an instrument that reports float values and that inserts the sequence { FLT_MAX, FLT_MIN, FLT_MAX } as a marker for the start and end of a test run. We can assume the probably of the instrument reporting this sequence as readings is negligible and that data reported outside of the test runs is random noise. Here is how we can search for the first test run data in the data sequence.

First, we create a representation of the sequence that denotes the limit of a test run.

const float markerSequence[] = { FLT_MAX , FLT_MIN , FLT_MAX };
const bsl::size_t markerSequenceLength = sizeof markerSequence
/ sizeof *markerSequence;

Next, we obtain the data to be searched. (In this example, we will use simulated data.)

bsl::vector<float> data; // Container provides random access iterators.
doTestRun(&data);
Definition bslstl_vector.h:1025

Then, we define and create our searcher object:

bsl::boyer_moore_horspool_searcher<const float *>
searchForMarker(markerSequence,
markerSequence
+ markerSequenceLength);

Notice that no equality comparison functor was specified so searchForMarker will use bsl::equal_to<float> by default.

Now, we invoke our searcher on the instrument data.

const bsl::pair<DataConstItr, DataConstItr> notFound(data.cend(),
data.cend());
bsl::pair<DataConstItr, DataConstItr> markerPosition = searchForMarker(
data.cbegin(),
data.cend());
assert(notFound != markerPosition);
DataConstItr startOfTestRun = markerPosition.second;
VALUE_TYPE const * const_iterator
Definition bslstl_vector.h:1058

Finally, we locate the marker of the end of the first test run and pass the location of the first test run data to some other function for processing.

markerPosition = searchForMarker(markerPosition.second, data.cend());
assert(notFound != markerPosition);
DataConstItr endOfTestRun = markerPosition.first;
processTestRun(startOfTestRun, endOfTestRun);

{bslstl_defaultsearcher |Example 3} shows how the same problem is addressed using bsl::default_searcher. Notice that other example uses data from a container that provides bidirectional iterators (and forward iterators would have sufficed), whereas here random access iterators are required.

Example 4: Caching Searcher Objects

The construction of bsl::boyer_moore_horspool_searcher objects is small (the needle must be scanned, meta-data calculated, and results saved) but can be non-neglibible when one needs a great number of them. When there is a reasonable chance that one will have to repeat a given search, it can be worthwhile to cache the searcher objects for reuse.

The Problem

Suppose we have a long list of names, each consisting of a given name (first name) and a surname (last name), and that we wish to identify instances of reduplication of the given name in the surname. That is, we want to identify the cases where the given name is a case-insensitive substring of the surname. Examples include "Durand Durand", "James Jameson", "John St. John", and "Jean Valjean". In this example we will not accept nicknames and other approximate name forms as matches (e.g., "Joe Joseph", "Jack Johnson").

Since we want to perform our task as efficiently as possible, and since we expect many entries to have common given names (e.g., "John"), we decide to create a cache of searcher objects for those first names so they need not be reconstructed for each search of a surname.

Design Choices

To implement our cache we will use a bsl::unordered_map container. Allocating types must meet certain requirements to work properly with allocator-enabled containers such as bsl::unordered_map. bsl::boyer_moore_horspool_searcher does not, so we will use bslstl::BoyerMooreHorspoolSearcher, which does.

To clarify exposition, our cache will have the simple policy of retaining searcher objects indefinitely and ignore the real-world concern that our cache may grow so large that the search time exceeds construction time. Also, we will forgo techniques that might minimize the number of times data is copied.

Steps

First, we define our cache class:

// ====================================
// class MyCaseInsensitiveSearcherCache
// ====================================
class MyCaseInsensitiveSearcherCache {
// TYPES
public:
MyCaseInsensitiveCharHasher,
MyCaseInsensitiveCharComparer>
Searcher;
// PRIVATE TYPES
private:
// DATA
Map d_map;
// PRIVATE MANIPULATORS
/// Insert into this cache a key-value pair where the key is the
/// specified `key` and the value is a `Searcher` object created to
/// seek the needle specified by the key part. Note that this
/// arrangement guarantees that the iterators used by this cached
/// searcher object remain valid for the life of the searcher
/// object.
const Searcher& insertSearcher(const bsl::string& key);
public:
// CREATORS
/// Create an empty `MyCaseInsensitiveSearcherCache` object.
/// Optionally specify a `basicAllocator` used to supply memory. If
/// `basicAllocator` is 0, the currently installed default
/// allocator is used.
explicit MyCaseInsensitiveSearcherCache(bslma::Allocator
*basicAllocator = 0);
// MANIPULATORS
/// Return a `const`-reference to the cached server that can do a
/// case-insensitive search for the specified `needle`. If such a
/// searcher does not exist in the cache on entry, such a searcher
/// is constructed, added to the cache, and returned (by
/// `const`-reference).
const Searcher& getSearcher(const char *needle);
// ACCESSORS
/// Return the number of searcher objects in this cache.
bsl::size_t numSearchers() const;
};
Definition bslstl_string.h:1281
const CHAR_TYPE * const_iterator
Definition bslstl_string.h:1311
Definition bslstl_unorderedmap.h:1089
Definition bslma_allocator.h:457
Definition bslstl_boyermoorehorspoolsearcher.h:1007

Notice (see the typedef for Searcher) that we reuse the hash functor, MyCaseInsensitiveCharHasher, and equality comparison functor, MyCaseInsensitiveCharComparer, that were defined in {Example 2}.

Note that MyCaseInsensitiveSearcherCache itself is an allocating type. If we needed to make it compatible with BDE containers (e.g., to allow a cache of caches) a few additional features are needed. As we have no such need, those features are deferred.

Then, we implement the constructor:

// ------------------------------------
// class MyCaseInsensitiveSearcherCache
// ------------------------------------
// CREATORS
MyCaseInsensitiveSearcherCache::
MyCaseInsensitiveSearcherCache(bslma::Allocator *basicAllocator)
: d_map(basicAllocator)
{
}

Notice that basicAllocator is simply forwarded to d_map.

Next, we implement the public methods:

// MANIPULATORS
const
MyCaseInsensitiveSearcherCache::Searcher&
MyCaseInsensitiveSearcherCache::getSearcher(const char *needle)
{
bsl::string key(needle);
Map::iterator findResult = d_map.find(key);
if (d_map.end() == findResult) {
return insertSearcher(key); // RETURN
} else {
return findResult->second; // RETURN
}
}
// ACCESSORS
bsl::size_t MyCaseInsensitiveSearcherCache::numSearchers() const
{
return d_map.size();
}

Then, to complete our class, we implement the cache class private method:

// PRIVATE MANIPULATORS
const
MyCaseInsensitiveSearcherCache::Searcher&
MyCaseInsensitiveSearcherCache::insertSearcher(const bsl::string& key)
{
Searcher dummy(key.begin(), key.begin()); // to be overwritten
Map::value_type value(key, dummy);
bsl::pair<Map::iterator, bool> insertResult = d_map.insert(value);
assert(true == insertResult.second);
Map::iterator iterator = insertResult.first;
iterator->second = Searcher(iterator->first.begin(),
iterator->first.end());
return iterator->second;
}
void swap(basic_string &other) BSLS_KEYWORD_NOEXCEPT_SPECIFICATION(AllocatorTraits const_iterator begin() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:2533

Notice creating our element is a two step process. First, we insert the key with an arbitrary "dummy" searcher. Once the key (a string) exists in the map (at an address that is stable for the life of the map) we create a searcher object that refers to that key string for its search sequence, and overwrite the "dummy" part of previously inserted element.

Now, we show how the searcher object cache can be used. In this example, a fixed array represents our source of name entries, in random order:

struct {
const char *d_givenName_p;
const char *d_surname_p;
} DATA[] = {
// GIVEN SURNAME
// -------- ------------
{ "Donald" , "McDonald" }
, { "John" , "Johnson" }
, { "John" , "Saint Johns" }
, { "Jon" , "Literjon" }
, { "Jean" , "Valjean" }
, { "James" , "Jameson" }
, { "Will" , "Freewill" }
, { "John" , "Johns" }
, { "John" , "John" }
, { "John" , "Jones" }
, { "J'onn" , "J'onzz" }
, { "Donald" , "Donalds" }
, { "Donald" , "Mac Donald" }
, { "William", "Williams" }
, { "Durand" , "Durand" }
, { "John" , "Johnstown" }
, { "Major" , "Major" }
, { "Donald" , "MacDonald" }
, { "Patrick", "O'Patrick" }
, { "Chris", "Christie" }
, { "Don", "London" }
// ...
, { "Ivan" , "Ivanovich" }
};
bsl::size_t NUM_NAMES = sizeof DATA / sizeof *DATA;
MyAllocator myAllocator;
MyCaseInsensitiveSearcherCache searcherCache(&myAllocator);
bsl::string output;
for (bsl::size_t ti = 0; ti < NUM_NAMES; ++ti) {
const char * const givenName = DATA[ti].d_givenName_p;
const char * const surname = DATA[ti].d_surname_p;
const MyCaseInsensitiveSearcherCache::Searcher& searcher =
searcherCache.getSearcher(givenName);
assert(&myAllocator == searcher.allocator());

Notice that each searcher object in the cache (correctly) uses the same allocator as we specified for the cache itself.

The rest of the application:

const Result result = searcher(surname,
surname + bsl::strlen(surname));
const Result notFound = bsl::make_pair(surname + bsl::strlen(surname),
surname + bsl::strlen(surname));
char buffer[32];
if (notFound == result) {
sprintf(buffer, "ng: %-10s %-11s\n", givenName, surname);
} else {
sprintf(buffer, "OK: %-10s %-11s\n", givenName, surname);
}
output.append(buffer);
}
basic_string & append(const basic_string &suffix)
Definition bslstl_string.h:5574

Finally, we examine the collected output and confirm that our code is properly identifying the names of interest.

assert(0 == bsl::strcmp(output.c_str(),
"OK: Donald McDonald \n"
"OK: John Johnson \n"
"OK: John Saint Johns\n"
"OK: Jon Literjon \n"
"OK: Jean Valjean \n"
"OK: James Jameson \n"
"OK: Will Freewill \n"
"OK: John Johns \n"
"OK: John John \n"
"ng: John Jones \n"
"ng: J'onn J'onzz \n"
"OK: Donald Donalds \n"
"OK: Donald Mac Donald \n"
"OK: William Williams \n"
"OK: Durand Durand \n"
"OK: John Johnstown \n"
"OK: Major Major \n"
"OK: Donald MacDonald \n"
"OK: Patrick O'Patrick \n"
"OK: Chris Christie \n"
"OK: Don London \n"
"OK: Ivan Ivanovich \n"));
assert(searcherCache.numSearchers() < NUM_NAMES);
const CHAR_TYPE * c_str() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6705