BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlc_flathashmap

Detailed Description

Outline

Purpose

Provide an open-addressed unordered map container.

Classes

See also
bdlc_flathashtable, bdlc_flathashset

Description

This component defines a single class template, bdlc::FlatHashMap, that implements an open-addressed unordered map of items with unique keys.

Unordered maps are useful in situations when there is no meaningful way to order key values, when the order of the keys is irrelevant to the problem domain, or (even if there is a meaningful ordering) the value of ordering the keys is outweighed by the higher performance provided by unordered maps (compared to ordered maps). On platforms that support relevant SIMD instructions (e.g., SSE2), bdlc::FlatHashMap generally exhibits better performance than bsl::unordered_map.

An instantiation of bdlc::FlatHashMap is an allocator-aware, value-semantic type whose salient attributes are the collection of KEY-VALUE pairs contained, without regard to order. An instantiation may be provided with custom hash and key-equality functors, but those are not salient attributes. In particular, when comparing element values for equality between two different bdlc::FlatHashMap objects, the elements are compared using operator==.

The implemented data structure is inspired by Google's flat_hash_map CppCon presentations (available on YouTube). The implementation draws from Google's open source raw_hash_set.h file at: https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal.

Performance Caveats

bdlc::FlatHashMap is recommended for Intel platforms only (i.e., Linux and Windows, and pre-ARM Macs); on platforms using other processors (i.e., Sun and AIX), bdlc::FlatHashMap may have slower performance than bsl::unordered_map. However, note that bdlc::FlatHashMap will use significantly less memory than bsl::unordered_map on all platforms. Given the Intel-only performance caveat, it is recommended to benchmark before using bdlc::FlatHashMap – particularly on non-Intel production environments.

Interface Differences with unordered_map

A bdlc::FlatHashMap meets most of the requirements of an unordered associative container with forward iterators in the C++11 Standard [23.2.5]. It does not have the bucket interface, and locations of elements may change when the container is modified (and therefore iterators become invalid too). Allocator use follows BDE style, and the various allocator propagation attributes are not present (e.g., the allocator trait propagate_on_container_copy_assignment). The maximum load factor of the container (the ratio of size to capacity) is maintained by the container itself and is not settable (the maximum load factor is implementation defined and fixed).

Load Factor and Resizing

An invariant of bdlc::FlatHashMap is that 0 <= load_factor() <= max_load_factor() <= 1.0. Any operation that would result in load_factor() > max_load_factor() for a bdlc::FlatHashMap causes the capacity to increase. This resizing allocates new memory, copies or moves all elements to the new memory, and reclaims the original memory. The transfer of elements involves rehashing each element to determine its new location. As such, all iterators, pointers, and references to elements of the bdlc::FlatHashMap are invalidated on a resize.

Requirements on KEY, HASH, and EQUAL

The template parameter type KEY must be copy or move constructible. The template parameter types HASH and EQUAL must be default and copy constructible function objects.

HASH must support a function-call operator compatible with the following statements for an object key of type KEY:

HASH hash;
bsl::size_t result = hash(key);

EQUAL must support a function-call operator compatible with the following statements for objects key1 and key2 of type KEY:

EQUAL equal;
bool result = equal(key1, key2);

where the definition of the called function defines an equivalence relationship on keys that is both reflexive and transitive.

HASH and EQUAL function objects are further constrained: if the comparator determines that two values are equal, the hasher must produce the same hash value for each.

Iterator, Pointer, and Reference Invalidation

Any change in capacity of a bdlc::FlatHashMap invalidates all pointers, references, and iterators. A bdlc::FlatHashMap manipulator that erases an element invalidates all pointers, references, and iterators to the erased element.

Exception Safety

A bdlc::FlatHashMap is exception neutral, and all of the methods of bdlc::FlatHashMap provide the basic exception safety guarantee (see {bsldoc_glossary |Basic Guarantee}).

Move Semantics in C++03

Move-only types are supported by bdlc::FlatHashMap on C++11, and later, platforms only (where BSLMF_MOVABLEREF_USES_RVALUE_REFERENCES is defined), and are not supported on C++03 platforms. Unfortunately, in C++03, there are user-defined types where a bslmf::MovableRef will not safely degrade to an lvalue reference when a move constructor is not available (types providing a constructor template taking any type), so bslmf::MovableRefUtil::move cannot be used directly on a user-supplied template parameter type.

Usage

In this section we show intended use of this component.

Example 1: Gathering Document Statistics

Suppose one wished to gather statistics on the words appearing in a large set of documents on disk or in a database. Gathering those statistics is intrusive (as one is competing for access to the documents with the regular users) and must be done as quickly as possible. Moreover, the set of unique words appearing in those documents may be high. The English language has in excess of a million words (albeit many appear infrequently), and, if the documents contain serial numbers, or Social Security numbers, or chemical formulas, etc., then the O[log(n)] insertion time of ordered maps may well be inadequate. An unordered map, having an O[1] typical insertion cost, is a viable alternative.

This example illustrates the use of bdlc::FlatHashMap to gather one simple statistic (counts of unique words) on a portion of a single document. To avoid irrelevant details of acquiring the data, the data is stored in static arrays:

static char document[] =
" IN CONGRESS, July 4, 1776.\n"
"\n"
" The unanimous Declaration of the thirteen united States of America,\n"
"\n"
" When in the Course of human events, it becomes necessary for one\n"
" people to dissolve the political bands which have connected them with\n"
" another, and to assume among the powers of the earth, the separate\n"
" and equal station to which the Laws of Nature and of Nature's G-d\n"
" entitle them, a decent respect to the opinions of mankind requires\n"
" that they should declare the causes which impel them to the\n"
" separation. We hold these truths to be self-evident, that all men\n"
" are created equal, that they are endowed by their Creator with\n"
" certain unalienable Rights, that among these are Life, Liberty and\n"
" the pursuit of Happiness. That to secure these rights, Governments\n"
" are instituted among Men, deriving their just powers from the consent\n"
" of the governed, That whenever any Form of Government becomes\n";

First, we define an alias to make our code more comprehensible:

Definition bdlc_flathashmap.h:433

Next, we create an (empty) flat hash map to hold our word tallies:

WordTally wordTally;

Then, we define the set of characters that define word boundaries:

const char *delimiters = " \n\t,:;.()[]?!/";

Next, we extract the words from our document. Note that strtok modifies the document array (which was not made const).

For each iteration of the loop, a map entry matching the key value parsed by strtok is obtained. On the first occurrence of a word, the map has no such entry, so one is created with a default value of the mapped value (0, just what we want in this case) and inserted into the map where it is found on any subsequent occurrences of the word. The operator[] method returns a reference providing modifiable access to the mapped value. Here, we apply the ++ operator to that reference to maintain a tally for the word:

for (char *cur = strtok(document, delimiters);
cur;
cur = strtok(NULL, delimiters)) {
++wordTally[bsl::string(cur)];
}
basic_string< char > string
Definition bslstl_string.h:782

Now that the data has been (quickly) gathered, we can indulge in analysis that is more time consuming. For example, we can define a comparison function, copy the data to another container (e.g., bsl::vector), sort the entries, and determine the 10 most commonly used words in the given documents:

typedef bsl::pair<bsl::string, int> WordTallyEntry;
// Assignable equivalent to 'WordTally::value_type'. Note that
// 'bsl::vector' requires assignable types.
struct WordTallyEntryCompare {
static bool lessThan(const WordTallyEntry& a,
const WordTallyEntry& b) {
return a.second < b.second;
}
static bool greaterThan(const WordTallyEntry& a,
const WordTallyEntry& b) {
return a.second > b.second;
}
};
bsl::vector<WordTallyEntry> array(wordTally.cbegin(), wordTally.cend());
assert(10 <= array.size());
bsl::partial_sort(array.begin(),
array.begin() + 10,
array.end(),
WordTallyEntryCompare::greaterThan);
Definition bslstl_pair.h:1210
Definition bslstl_vector.h:1025
TYPE second
Definition bslstl_pair.h:908

Notice that partial_sort suffices here since we seek only the 10 most used words, not a complete distribution of word counts.

Finally, we print the sorted portion of array:

end = cur + 10;
end != cur; ++cur) {
printf("%-10s %4d\n", cur->first.c_str(), cur->second);
}
VALUE_TYPE const * const_iterator
Definition bslstl_vector.h:1058

and standard output shows:

the 13
of 10
to 7
that 4
are 4
and 4
which 3
these 3
them 3
among 3