Provide an open-addressed unordered map container.
More...
Namespaces |
namespace | bdlc |
Detailed Description
- Outline
-
-
- Purpose:
- Provide an open-addressed unordered map container.
-
- Classes:
-
- See also:
- Component bdlc_flathashtable, Component bdlc_flathashset
-
- Description:
- This component defines a single class template,
bdlc::FlatHashMap
, that implements an open-addressed unordered map of items with unique keys.
- Unordered maps are useful in situations when there is no meaningful way to order key values, when the order of the keys is irrelevant to the problem domain, or (even if there is a meaningful ordering) the value of ordering the keys is outweighed by the higher performance provided by unordered maps (compared to ordered maps). On platforms that support relevant SIMD instructions (e.g., SSE2),
bdlc::FlatHashMap
generally exhibits better performance than bsl::unordered_map
.
- An instantiation of
bdlc::FlatHashMap
is an allocator-aware, value-semantic type whose salient attributes are the collection of KEY-VALUE
pairs contained, without regard to order. An instantiation may be provided with custom hash and key-equality functors, but those are not salient attributes. In particular, when comparing element values for equality between two different bdlc::FlatHashMap
objects, the elements are compared using operator==
.
- The implemented data structure is inspired by Google's
flat_hash_map
CppCon presentations (available on YouTube). The implementation draws from Google's open source raw_hash_set.h
file at: https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal.
-
- Performance Caveats:
bdlc::FlatHashMap
is recommended for Intel platforms only (i.e., Linux and Windows, and pre-ARM Macs); on platforms using other processors (i.e., Sun and AIX), bdlc::FlatHashMap
may have slower performance than bsl::unordered_map
. However, note that bdlc::FlatHashMap
will use significantly less memory than bsl::unordered_map
on all platforms. Given the Intel-only performance caveat, it is recommended to benchmark before using bdlc::FlatHashMap
-- particularly on non-Intel production environments.
-
- Interface Differences with unordered_map:
- A
bdlc::FlatHashMap
meets most of the requirements of an unordered associative container with forward iterators in the C++11 Standard [23.2.5]. It does not have the bucket interface, and locations of elements may change when the container is modified (and therefore iterators become invalid too). Allocator use follows BDE style, and the various allocator propagation attributes are not present (e.g., the allocator trait propagate_on_container_copy_assignment
). The maximum load factor of the container (the ratio of size to capacity) is maintained by the container itself and is not settable (the maximum load factor is implementation defined and fixed).
-
- Load Factor and Resizing:
- An invariant of
bdlc::FlatHashMap
is that 0 <= load_factor() <= max_load_factor() <= 1.0
. Any operation that would result in load_factor() > max_load_factor()
for a bdlc::FlatHashMap
causes the capacity to increase. This resizing allocates new memory, copies or moves all elements to the new memory, and reclaims the original memory. The transfer of elements involves rehashing each element to determine its new location. As such, all iterators, pointers, and references to elements of the bdlc::FlatHashMap
are invalidated on a resize.
-
- Requirements on KEY, HASH, and EQUAL:
- The template parameter type
KEY
must be copy or move constructible. The template parameter types HASH
and EQUAL
must be default and copy constructible function objects.
HASH
must support a function-call operator compatible with the following statements for an object key
of type KEY
: HASH hash;
bsl::size_t result = hash(key);
EQUAL
must support a function-call operator compatible with the following statements for objects key1
and key2
of type KEY
: EQUAL equal;
bool result = equal(key1, key2);
where the definition of the called function defines an equivalence relationship on keys that is both reflexive and transitive.
HASH
and EQUAL
function objects are further constrained: if the comparator determines that two values are equal, the hasher must produce the same hash value for each.
-
- Iterator, Pointer, and Reference Invalidation:
- Any change in capacity of a
bdlc::FlatHashMap
invalidates all pointers, references, and iterators. A bdlc::FlatHashMap
manipulator that erases an element invalidates all pointers, references, and iterators to the erased element.
-
- Exception Safety:
- A
bdlc::FlatHashMap
is exception neutral, and all of the methods of bdlc::FlatHashMap
provide the basic exception safety guarantee (see bsldoc_glossary
|Basic Guarantee).
-
- Move Semantics in C++03:
- Move-only types are supported by
bdlc::FlatHashMap
on C++11, and later, platforms only (where BSLMF_MOVABLEREF_USES_RVALUE_REFERENCES
is defined), and are not supported on C++03 platforms. Unfortunately, in C++03, there are user-defined types where a bslmf::MovableRef
will not safely degrade to an lvalue reference when a move constructor is not available (types providing a constructor template taking any type), so bslmf::MovableRefUtil::move
cannot be used directly on a user-supplied template parameter type.
-
- Usage:
- In this section we show intended use of this component.
-
- Example 1: Gathering Document Statistics:
- Suppose one wished to gather statistics on the words appearing in a large set of documents on disk or in a database. Gathering those statistics is intrusive (as one is competing for access to the documents with the regular users) and must be done as quickly as possible. Moreover, the set of unique words appearing in those documents may be high. The English language has in excess of a million words (albeit many appear infrequently), and, if the documents contain serial numbers, or Social Security numbers, or chemical formulas, etc., then the
O[log(n)]
insertion time of ordered maps may well be inadequate. An unordered map, having an O[1]
typical insertion cost, is a viable alternative.
- This example illustrates the use of
bdlc::FlatHashMap
to gather one simple statistic (counts of unique words) on a portion of a single document. To avoid irrelevant details of acquiring the data, the data is stored in static arrays: static char document[] =
" IN CONGRESS, July 4, 1776.\n"
"\n"
" The unanimous Declaration of the thirteen united States of America,\n"
"\n"
" When in the Course of human events, it becomes necessary for one\n"
" people to dissolve the political bands which have connected them with\n"
" another, and to assume among the powers of the earth, the separate\n"
" and equal station to which the Laws of Nature and of Nature's G-d\n"
" entitle them, a decent respect to the opinions of mankind requires\n"
" that they should declare the causes which impel them to the\n"
" separation. We hold these truths to be self-evident, that all men\n"
" are created equal, that they are endowed by their Creator with\n"
" certain unalienable Rights, that among these are Life, Liberty and\n"
" the pursuit of Happiness. That to secure these rights, Governments\n"
" are instituted among Men, deriving their just powers from the consent\n"
" of the governed, That whenever any Form of Government becomes\n";
First, we define an alias to make our code more comprehensible: Next, we create an (empty) flat hash map to hold our word tallies: Then, we define the set of characters that define word boundaries: const char *delimiters = " \n\t,:;.()[]?!/";
Next, we extract the words from our document. Note that strtok
modifies the document array (which was not made const
).
- For each iteration of the loop, a map entry matching the key value parsed by
strtok
is obtained. On the first occurrence of a word, the map has no such entry, so one is created with a default value of the mapped value (0, just what we want in this case) and inserted into the map where it is found on any subsequent occurrences of the word. The operator[]
method returns a reference providing modifiable access to the mapped value. Here, we apply the ++
operator to that reference to maintain a tally for the word: for (char *cur = strtok(document, delimiters);
cur;
cur = strtok(NULL, delimiters)) {
++wordTally[bsl::string(cur)];
}
Now that the data has been (quickly) gathered, we can indulge in analysis that is more time consuming. For example, we can define a comparison function, copy the data to another container (e.g., bsl::vector
), sort the entries, and determine the 10 most commonly used words in the given documents: typedef bsl::pair<bsl::string, int> WordTallyEntry;
struct WordTallyEntryCompare {
static bool lessThan(const WordTallyEntry& a,
const WordTallyEntry& b) {
return a.second < b.second;
}
static bool greaterThan(const WordTallyEntry& a,
const WordTallyEntry& b) {
return a.second > b.second;
}
};
bsl::vector<WordTallyEntry> array(wordTally.cbegin(), wordTally.cend());
assert(10 <= array.size());
bsl::partial_sort(array.begin(),
array.begin() + 10,
array.end(),
WordTallyEntryCompare::greaterThan);
Notice that partial_sort
suffices here since we seek only the 10 most used words, not a complete distribution of word counts.
- Finally, we print the sorted portion of
array
: and standard output shows: the 13
of 10
to 7
that 4
are 4
and 4
which 3
these 3
them 3
among 3