BDE 4.14.0 Production release
|
Provide an STL-compliant boyer_moore_horspool_searcher
class.
Canonical header: bsl_functional.h
This component defines two class templates, bsl::boyer_moore_horspool_searcher
and bslstl::BoyerMooreHorspoolSearcher
. Both are compliant with section [func.search.bmh]
of the C++ Standard (C++17 and later).
bsl::boyer_moore_horspool_searcher
is strictly limited to the Standard and is provided for clients for whom standard compliance is a priority. bslslt::BoyerMoreHorspoolSearcher
provides several accessors that are not mentioned in the Standard. Moreover, bslslt::BoyerMoreHorspoolSearcher
is "plumbed" for BDE allocators and can be used with BDE standard containers whereas the compliant strict class always uses the currently installed default allocator. See {Example 4} below.
Except where there is a relevant difference, both are described below as if they were one.
This class has several template parameters:
RNDACC_ITR_NEEDLE
: The type used to specify (on construction) the range of values being sought (the "needle").
HASH
: The functor type used to hash metadata for the unique values of the needle. See {Requirements for HASH
and EQUAL
}.
EQUAL
: The functor type used to compare values when storing/accessing needle metadata. See {Requirements for HASH
and EQUAL
}.
The class also provides a functor-style interface that accepts two iterators that define the range of values to be searched (the "haystack"). Once constructed, a single searcher object can be re-used to search multiple haystacks (for the same needle value).
The iterators defining the haystack need not be of the same type as those that define the needle. Moreover, the search method of the searcher can be overloaded for an arbitrary number of different haystack iterators (subject to {Iterator Requirements}).
The search algorithm used is an implementation of the well-known Boyer, Moore, Horspool (BMH) Algorithm for string matching (see https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm). In the typical case, this algorithm offers complexity of O(N)
for a haystack of length N
.
The two independent iterators types associated with this class – one defining the needle, the other defining the haystack – must meet the requirements of RandomAccessIterator.
Additionally:
The operations of either of the iterator types are allowed to throw exceptions.
Iterators defining needles are required to remain valid as long as the searcher object might be used.
The comparer class, EQUAL
, must meet the requirements of BinaryPredicate:
operator()
method that, given a *RandomAccessIterator', iterator
, can be invoked as operator()(*iterator, *iterator)
.bool
.Comparer classes are allowed to throw exceptions.
The behavior is undefined unless two values that are deemed equal by the EQUAL
functor generate the same value by the HASH
functor. That is:
implies:
This implementation handles needle metadata using a fixed size array when the value_type
is char
(either signed
or unsigned
flavors). For needles of typical size, this choice results in somewhat more memory use than it would have if some dynamically sized container were used; however, the faster access during searches warrants the tradeoff.
In this section we show the intended usage of this component.
The problem of searching a sequence of characters for a particular sub-sequence arises in many applications. For example, one might need to know if some document contains a particular word of interest. For example, suppose we would like to know the first occurrence of the word "United" in the Declaration of Independence (of the United States):
First, we obtain the text of document and word of interest as sequences of char
values.
Then, we create a default_searcher object (a functor) using the given word
:
Notice that no equality comparison functor was specified so searchForUnited
will use bsl::equal_to<char>
by default.
Now, we invoke our functor, specifying the range of the document to be searched:
Finally, we notice that the search correctly ignored the appearance of the word "united" (all lower case) in the second sentence.
{bslstl_defaultsearcher |Example 1} shows how the same problem is addressed using bsl::default_searcher
.
As seen in {Example 1} above, the default equality comparison functor is case sensitive. If one needs case in-sensitive searches, a non-default equality comparison class and a non-default hash functor must be specified.
First, define (at file scope if using a pre-C++11 compiler) an equality comparison class that provides the required functor interface:
Then, define (again at file scope, if pre-C++11), a hash functor so that two values, irrespective of their case, hash to the same value:
Now, specify bsl::boyer_moore_horspool_searcher
type for and create a searcher object to search for word
:
Note that the new searcher object will use defaulted constructed MyCaseInsensitiveCharHasher
and MyCaseInsensitiveCharComparer
classes. If stateful functors are required such objects can be passed in the optional constructor arguments.
Now, we invoke our searcher functor, specifying that the same document searched in {Example 1}:
Finally, we find the next occurrence of word
by reusing the same searcher object, this time instructing it to begin its search just after the previous occurrence of word
was found:
{bslstl_defaultsearcher |Example 2} shows how the same problem is addressed using bsl::default_searcher
.
The BMH searcher class template is not constrained to searching for char
values. Searches can be done on other types (see {Iterator Requirements}). Moreover the container of the sequence being sought (the "needle") need not the same as the sequence being searched (the "haystack").
Suppose one has data from an instrument that reports float
values and that inserts the sequence { FLT_MAX, FLT_MIN, FLT_MAX }
as a marker for the start and end of a test run. We can assume the probably of the instrument reporting this sequence as readings is negligible and that data reported outside of the test runs is random noise. Here is how we can search for the first test run data in the data sequence.
First, we create a representation of the sequence that denotes the limit of a test run.
Next, we obtain the data to be searched. (In this example, we will use simulated data.)
Then, we define and create our searcher object:
Notice that no equality comparison functor was specified so searchForMarker
will use bsl::equal_to<float>
by default.
Now, we invoke our searcher on the instrument data.
Finally, we locate the marker of the end of the first test run and pass the location of the first test run data to some other function for processing.
{bslstl_defaultsearcher |Example 3} shows how the same problem is addressed using bsl::default_searcher
. Notice that other example uses data
from a container that provides bidirectional iterators (and forward iterators would have sufficed), whereas here random access iterators are required.
The construction of bsl::boyer_moore_horspool_searcher
objects is small (the needle must be scanned, meta-data calculated, and results saved) but can be non-neglibible when one needs a great number of them. When there is a reasonable chance that one will have to repeat a given search, it can be worthwhile to cache the searcher objects for reuse.
Suppose we have a long list of names, each consisting of a given name (first name) and a surname (last name), and that we wish to identify instances of reduplication of the given name in the surname. That is, we want to identify the cases where the given name is a case-insensitive substring of the surname. Examples include "Durand Durand", "James Jameson", "John St. John", and "Jean Valjean". In this example we will not accept nicknames and other approximate name forms as matches (e.g., "Joe Joseph", "Jack Johnson").
Since we want to perform our task as efficiently as possible, and since we expect many entries to have common given names (e.g., "John"), we decide to create a cache of searcher objects for those first names so they need not be reconstructed for each search of a surname.
To implement our cache we will use a bsl::unordered_map
container. Allocating types must meet certain requirements to work properly with allocator-enabled containers such as bsl::unordered_map
. bsl::boyer_moore_horspool_searcher
does not, so we will use bslstl::BoyerMooreHorspoolSearcher
, which does.
To clarify exposition, our cache will have the simple policy of retaining searcher objects indefinitely and ignore the real-world concern that our cache may grow so large that the search time exceeds construction time. Also, we will forgo techniques that might minimize the number of times data is copied.
First, we define our cache class:
Notice (see the typedef
for Searcher
) that we reuse the hash functor, MyCaseInsensitiveCharHasher
, and equality comparison functor, MyCaseInsensitiveCharComparer
, that were defined in {Example 2}.
Note that MyCaseInsensitiveSearcherCache
itself is an allocating type. If we needed to make it compatible with BDE containers (e.g., to allow a cache of caches) a few additional features are needed. As we have no such need, those features are deferred.
Then, we implement the constructor:
Notice that basicAllocator
is simply forwarded to d_map
.
Next, we implement the public methods:
Then, to complete our class, we implement the cache class private method:
Notice creating our element is a two step process. First, we insert the key with an arbitrary "dummy" searcher. Once the key (a string) exists in the map (at an address that is stable for the life of the map) we create a searcher object that refers to that key string for its search sequence, and overwrite the "dummy" part of previously inserted element.
Now, we show how the searcher object cache can be used. In this example, a fixed array represents our source of name entries, in random order:
Notice that each searcher object in the cache (correctly) uses the same allocator as we specified for the cache itself.
The rest of the application:
Finally, we examine the collected output
and confirm that our code is properly identifying the names of interest.