BDE 4.14.0 Production release
Loading...
Searching...
No Matches
bdlc_flathashset

Detailed Description

Outline

Purpose

Provide an open-addressed unordered set container.

Classes

See also
bdlc_flathashtable, bdlc_flathashmap

Description

This component defines a single class template, bdlc::FlatHashSet, that implements an open-addressed unordered set of items with unique values.

Unordered sets are useful in situations when there is no meaningful way to order key values, when the order of the values is irrelevant to the problem domain, or (even if there is a meaningful ordering) the value of ordering the results is outweighed by the higher performance provided by unordered sets (compared to ordered sets). On platforms that support relevant SIMD instructions (e.g., SSE2), bdlc::FlatHashSet generally exhibits better performance than bsl::unordered_set.

An instantiation of bdlc::FlatHashSet is an allocator-aware, value-semantic type whose salient attributes are the set of values contained, without regard to order. An instantiation may be provided with custom hash and equality functors, but those are not salient attributes. In particular, when comparing element values for equality between two different bdlc::FlatHashSet objects, the elements are compared using operator==.

The implemented data structure is inspired by Google's flat_hash_map CppCon presentations (available on YouTube). The implementation draws from Google's open source raw_hash_set.h file at: https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal.

Performance Caveats

bdlc::FlatHashSet is recommended for Intel platforms only (i.e., Linux and Windows, and pre-ARM Macs); on platforms using other processors (i.e., Sun and AIX), bdlc::FlatHashSet may have slower performance than bsl::unordered_set. However, note that bdlc::FlatHashSet will use significantly less memory than bsl::unordered_set on all platforms. Given the Intel-only performance caveat, it is recommended to benchmark before using bdlc::FlatHashSet – particularly on non-Intel production environments.

Interface Differences with bsl::unordered_set

A bdlc::FlatHashSet meets most of the requirements of an unordered associative container with forward iterators in the C++11 Standard [23.2.5]. It does not have the bucket interface, and locations of elements may change when the container is modified (and therefore iterators become invalid too). Allocator use follows BDE style, and the various allocator propagation attributes are not present (e.g., the allocator trait propagate_on_container_copy_assignment). The maximum load factor of the container (the ratio of size to capacity) is maintained by the container itself and is not settable (the maximum load factor is implementation defined and fixed).

Load Factor and Resizing

An invariant of bdlc::FlatHashSet is that 0 <= load_factor() <= max_load_factor() <= 1.0. Any operation that would result in load_factor() > max_load_factor() for a bdlc::FlatHashSet causes the capacity to increase. This resizing allocates new memory, copies or moves all elements to the new memory, and reclaims the original memory. The transfer of elements involves rehashing each element to determine its new location. As such, all iterators, pointers, and references to elements of the bdlc::FlatHashSet are invalidated on a resize.

Requirements on KEY, HASH, and EQUAL

The template parameter type KEY must be copy or move constructible. The template parameter types HASH and EQUAL must be default and copy constructible function objects.

HASH must support a function-call operator compatible with the following statements for an object key of type KEY:

HASH hash;
bsl::size_t result = hash(key);

EQUAL must support a function-call operator compatible with the following statements for objects key1 and key2 of type KEY:

EQUAL equal;
bool result = equal(key1, key2);

where the definition of the called function defines an equivalence relationship on keys that is both reflexive and transitive.

HASH and EQUAL function objects are further constrained: if the comparator determines that two values are equal, the hasher must produce the same hash value for each.

Iterator, Pointer, and Reference Invalidation

Any change in capacity of a bdlc::FlatHashSet invalidates all pointers, references, and iterators. A bdlc::FlatHashSet manipulator that erases an element invalidates all pointers, references, and iterators to the erased element.

Exception Safety

A bdlc::FlatHashSet is exception neutral, and all of the methods of bdlc::FlatHashSet provide the basic exception safety guarantee (see {bsldoc_glossary |Basic Guarantee}).

Move Semantics in C++03

Move-only types are supported by bdlc::FlatHashSet on C++11, and later, platforms only (where BSLMF_MOVABLEREF_USES_RVALUE_REFERENCES is defined), and are not supported on C++03 platforms. Unfortunately, in C++03, there are user-defined types where a bslmf::MovableRef will not safely degrade to an lvalue reference when a move constructor is not available (types providing a constructor template taking any type), so bslmf::MovableRefUtil::move cannot be used directly on a user-supplied template parameter type.

Usage

In this section we show intended use of this component.

Example 1: Categorizing Data

Suppose one is analyzing data on a set of customers, and each customer is categorized by several attributes: customer type, geographic area, and (internal) project code; and that each attribute takes on one of a limited set of values. This data can be handled by creating an enumeration for each of the attributes:

typedef enum {
e_REPEAT
, e_DISCOUNT
, e_IMPULSE
, e_NEED_BASED
, e_BUSINESS
, e_NON_PROFIT
, e_INSTITUTE
// ...
} CustomerCode;
typedef enum {
e_USA_EAST
, e_USA_WEST
, e_CANADA
, e_MEXICO
, e_ENGLAND
, e_SCOTLAND
, e_FRANCE
, e_GERMANY
, e_RUSSIA
// ...
} LocationCode;
typedef enum {
e_TOAST
, e_GREEN
, e_FAST
, e_TIDY
, e_PEARL
, e_SMITH
// ...
} ProjectCode;

The data set (randomly generated for this example) is provided in a statically initialized array:

static const struct CustomerProfile {
CustomerCode d_customer;
LocationCode d_location;
ProjectCode d_project;
} customerProfiles[] = {
{ e_IMPULSE , e_CANADA , e_SMITH },
{ e_NON_PROFIT, e_USA_EAST, e_GREEN },
{ e_INSTITUTE , e_USA_EAST, e_TOAST },
{ e_NON_PROFIT, e_CANADA , e_PEARL },
{ e_NEED_BASED, e_CANADA , e_FAST },
{ e_BUSINESS , e_ENGLAND , e_PEARL },
{ e_REPEAT , e_SCOTLAND, e_TIDY },
{ e_INSTITUTE , e_MEXICO , e_PEARL },
{ e_DISCOUNT , e_USA_EAST, e_GREEN },
{ e_BUSINESS , e_USA_EAST, e_GREEN },
{ e_IMPULSE , e_MEXICO , e_TOAST },
{ e_DISCOUNT , e_GERMANY , e_FAST },
{ e_INSTITUTE , e_FRANCE , e_FAST },
{ e_NON_PROFIT, e_ENGLAND , e_PEARL },
{ e_BUSINESS , e_ENGLAND , e_TIDY },
{ e_BUSINESS , e_CANADA , e_GREEN },
{ e_INSTITUTE , e_FRANCE , e_FAST },
{ e_IMPULSE , e_RUSSIA , e_TOAST },
{ e_REPEAT , e_USA_WEST, e_TOAST },
{ e_IMPULSE , e_CANADA , e_TIDY },
{ e_NON_PROFIT, e_GERMANY , e_GREEN },
{ e_INSTITUTE , e_USA_EAST, e_TOAST },
{ e_INSTITUTE , e_FRANCE , e_FAST },
{ e_IMPULSE , e_SCOTLAND, e_SMITH },
{ e_INSTITUTE , e_USA_EAST, e_PEARL },
{ e_INSTITUTE , e_USA_EAST, e_TOAST },
{ e_NON_PROFIT, e_ENGLAND , e_PEARL },
{ e_IMPULSE , e_GERMANY , e_FAST },
{ e_REPEAT , e_GERMANY , e_FAST },
{ e_REPEAT , e_MEXICO , e_PEARL },
{ e_IMPULSE , e_GERMANY , e_TIDY },
{ e_IMPULSE , e_MEXICO , e_TOAST },
{ e_NON_PROFIT, e_SCOTLAND, e_SMITH },
{ e_NEED_BASED, e_MEXICO , e_TOAST },
{ e_NON_PROFIT, e_FRANCE , e_SMITH },
{ e_INSTITUTE , e_MEXICO , e_TIDY },
{ e_NON_PROFIT, e_FRANCE , e_TIDY },
{ e_IMPULSE , e_FRANCE , e_FAST },
{ e_DISCOUNT , e_RUSSIA , e_TIDY },
{ e_IMPULSE , e_USA_EAST, e_TIDY },
{ e_IMPULSE , e_USA_WEST, e_FAST },
{ e_NON_PROFIT, e_FRANCE , e_TIDY },
{ e_BUSINESS , e_ENGLAND , e_GREEN },
{ e_REPEAT , e_FRANCE , e_TOAST },
{ e_REPEAT , e_RUSSIA , e_SMITH },
{ e_REPEAT , e_RUSSIA , e_GREEN },
{ e_IMPULSE , e_CANADA , e_FAST },
{ e_NON_PROFIT, e_USA_EAST, e_FAST },
{ e_NEED_BASED, e_USA_WEST, e_TOAST },
{ e_NON_PROFIT, e_GERMANY , e_TIDY },
{ e_NON_PROFIT, e_ENGLAND , e_GREEN },
{ e_REPEAT , e_GERMANY , e_PEARL },
{ e_NEED_BASED, e_USA_EAST, e_PEARL },
{ e_NON_PROFIT, e_RUSSIA , e_PEARL },
{ e_NEED_BASED, e_ENGLAND , e_SMITH },
{ e_INSTITUTE , e_CANADA , e_SMITH },
{ e_NEED_BASED, e_ENGLAND , e_TOAST },
{ e_NON_PROFIT, e_MEXICO , e_TIDY },
{ e_BUSINESS , e_GERMANY , e_FAST },
{ e_NEED_BASED, e_SCOTLAND, e_PEARL },
{ e_NON_PROFIT, e_USA_WEST, e_TIDY },
{ e_NON_PROFIT, e_USA_WEST, e_TOAST },
{ e_IMPULSE , e_FRANCE , e_PEARL },
{ e_IMPULSE , e_ENGLAND , e_FAST },
{ e_IMPULSE , e_USA_WEST, e_GREEN },
{ e_DISCOUNT , e_MEXICO , e_SMITH },
{ e_INSTITUTE , e_GERMANY , e_TOAST },
{ e_NEED_BASED, e_CANADA , e_PEARL },
{ e_NON_PROFIT, e_USA_WEST, e_FAST },
{ e_DISCOUNT , e_RUSSIA , e_SMITH },
{ e_INSTITUTE , e_USA_WEST, e_GREEN },
{ e_INSTITUTE , e_RUSSIA , e_TOAST },
{ e_INSTITUTE , e_FRANCE , e_SMITH },
{ e_INSTITUTE , e_SCOTLAND, e_SMITH },
{ e_NON_PROFIT, e_ENGLAND , e_PEARL },
{ e_NON_PROFIT, e_CANADA , e_SMITH },
{ e_NON_PROFIT, e_USA_EAST, e_TOAST },
{ e_REPEAT , e_FRANCE , e_TOAST },
{ e_NEED_BASED, e_FRANCE , e_FAST },
{ e_DISCOUNT , e_MEXICO , e_TOAST },
{ e_DISCOUNT , e_FRANCE , e_GREEN },
{ e_IMPULSE , e_USA_EAST, e_FAST },
{ e_REPEAT , e_USA_EAST, e_GREEN },
{ e_NON_PROFIT, e_GERMANY , e_GREEN },
{ e_INSTITUTE , e_CANADA , e_SMITH },
{ e_NEED_BASED, e_SCOTLAND, e_TOAST },
{ e_NEED_BASED, e_GERMANY , e_FAST },
{ e_NON_PROFIT, e_RUSSIA , e_TOAST },
{ e_BUSINESS , e_ENGLAND , e_PEARL },
{ e_NEED_BASED, e_USA_EAST, e_TOAST },
{ e_INSTITUTE , e_USA_EAST, e_SMITH },
{ e_DISCOUNT , e_USA_EAST, e_PEARL },
{ e_REPEAT , e_SCOTLAND, e_FAST },
{ e_IMPULSE , e_GERMANY , e_TIDY },
{ e_DISCOUNT , e_CANADA , e_TIDY },
{ e_IMPULSE , e_USA_EAST, e_TIDY },
{ e_IMPULSE , e_GERMANY , e_TIDY },
{ e_NON_PROFIT, e_ENGLAND , e_FAST },
{ e_NON_PROFIT, e_USA_WEST, e_TIDY },
{ e_REPEAT , e_MEXICO , e_TOAST },
};
const bsl::size_t numCustomerProfiles = sizeof customerProfiles
/ sizeof *customerProfiles;

Suppose, as the first step in our analysis, we wish to determine the number of unique combinations of customer attributes that exist in our data set. We can do that by inserting each data item into a flat hash set: the first insert of a combination will succeed, the others will fail, but at the end of the process, the set will contain one entry for every unique combination in our data.

First, as there are no standard methods for hashing or comparing our user-defined types, we define CustomerProfileHash and CustomerProfileEqual classes, each a stateless functor. Note that there is no meaningful ordering of the attribute values, they are merely arbitrary code numbers; nothing is lost by using an unordered set instead of an ordered set:

class CustomerProfileHash {
public:
// CREATORS
//! CustomerProfileHash() = default;
// Create a 'CustomerProfileHash' object.
//! CustomerProfileHash(const CustomerProfileHash& original) = default;
// Create a 'CustomerProfileHash' object. Note that as
// 'CustomerProfileHash' is an empty (stateless) type, this
// operation has no observable effect.
//! ~CustomerProfileHash() = default;
// Destroy this object.
// ACCESSORS
bsl::size_t operator()(const CustomerProfile& x) const;
// Return a hash value for the specified 'x'.
};

The hash function combines the several enumerated values from the class (each a small int value) into a single, unique int value, and then applies the default hash function for int.

// ACCESSORS
bsl::size_t CustomerProfileHash::operator()(const CustomerProfile& x) const
{
return bsl::hash<int>()( x.d_location * 100 * 100
+ x.d_customer * 100
+ x.d_project);
}
class CustomerProfileEqual {
public:
// CREATORS
//! CustomerProfileEqual() = default;
// Create a 'CustomerProfileEqual' object.
//! CustomerProfileEqual(const CustomerProfileEqual& original)
//! = default;
// Create a 'CustomerProfileEqual' object. Note that as
// 'CustomerProfileEqual' is an empty (stateless) type, this
// operation has no observable effect.
//! ~CustomerProfileEqual() = default;
// Destroy this object.
// ACCESSORS
bool operator()(const CustomerProfile& lhs,
const CustomerProfile& rhs) const;
// Return 'true' if the specified 'lhs' has the same value as the
// specified 'rhs', and 'false' otherwise.
};
// ACCESSORS
bool CustomerProfileEqual::operator()(const CustomerProfile& lhs,
const CustomerProfile& rhs) const
{
return lhs.d_location == rhs.d_location
&& lhs.d_customer == rhs.d_customer
&& lhs.d_project == rhs.d_project;
}
Definition bslstl_hash.h:498

Notice that many of the required methods of the hash and comparator types are compiler generated. (The declarations of those methods are commented out and suffixed by an = default comment.)

Then, we define the type of the flat hash set:

typedef bdlc::FlatHashSet<CustomerProfile,
CustomerProfileHash,
CustomerProfileEqual> ProfileCategories;
Definition bdlc_flathashset.h:564

Next, we create a flat hash set and insert each item of customerProfiles:

bslma::TestAllocator oa("object", veryVeryVeryVerbose);
ProfileCategories profileCategories(&oa);
for (bsl::size_t idx = 0; idx < numCustomerProfiles; ++idx) {
profileCategories.insert(customerProfiles[idx]);
}
assert(numCustomerProfiles >= profileCategories.size());
Definition bslma_testallocator.h:384

Notice that we ignore the status returned by the insert method. We fully expect some operations to fail.

Finally, the size of profileCategories matches the number of unique customer profiles in this data set:

if (verbose) {
bsl::cout << numCustomerProfiles << ' ' << profileCategories.size()
<< bsl::endl;
}

Standard output shows:

100 84