BDE 4.14.0 Production release
Loading...
Searching...
No Matches
Package bslh

Modules

 bslh_defaulthashalgorithm
 Provide a reasonable hashing algorithm for default use.
 
 bslh_defaultseededhashalgorithm
 Provide a reasonable seeded hashing algorithm for default use.
 
 bslh_fibonaccibadhashwrapper
 Provide a wrapper to improve "bad" hash algorithms.
 
 bslh_filesystem
 Provide hash for std::filesystem::path.
 
 bslh_hash
 Provide a struct to run bslh hash algorithms on supported types.
 
 bslh_hashoptional
 Provide hashAppend for std::optional.
 
 bslh_hashpair
 Provide hashAppend for std::pair.
 
 bslh_hashtuple
 Provide hashAppend for std::tuple.
 
 bslh_hashvariant
 Provide hashAppend for std::variant.
 
 bslh_seededhash
 Provide a struct to run seeded bslh hash algorithms on types.
 
 bslh_seedgenerator
 Provide a class to generate arbitrary length seeds for algorithms.
 
 bslh_siphashalgorithm
 Provide an implementation of the SipHash algorithm.
 
 bslh_spookyhashalgorithm
 Provide an implementation of the SpookyHash algorithm.
 
 bslh_spookyhashalgorithmimp
 Provide BDE style encapsulation of 3rd party SpookyHash code.
 
 bslh_wyhashincrementalalgorithm
 Provide an implementation of the WyHash algorithm final v3.
 

Detailed Description

Purpose

Provide a framework for hashing types using swappable algorithms.

Mnemonic

Basic Standard Library Hashing (bslh)

Description

The 'bslh' package provides components for a more modular hashing implementation than is found in the standard. This implementation is based on ISO C++ Proposal N3980. An internal proposal for this is available at {TEAM BDE:MODULAR HASHING<GO>}. This package provides hashing algorithms as well as 'Hash' and 'SeededHash' structs which allow different algorithms to be applied to any type that has a 'hashAppend' function. This document will explain the overall benefits of the system, what type implementers need to do to make their types hashable in this system, what type users need to do to hash different types, how to extend different pieces of this system, and will provide a summary of all the components in this package. All sections are independent and can be read and used without having read the other sections.

Table of Contents

Terminology

Avalanche

Changing one bit in the input to the hashing algorithm results in an "avalanche", which causes each output bit to have a 50% probability of changing. The avalanche property means that two very similar values will produce completely dissimilar hashes.

Denial of Service (DoS)

Within the context of hash tables, Denial of Service (DoS) attacks refer to an attacker causing many hash table keys to collide to the same bucket. These collisions cause look-ups in the hash table to become very time consuming linear searches.

Funneling

An undesirable property of hashing algorithms that results in a large number of collisions when the inputs differ by only a few bits. Funneling is related to the avalanche property of a hashing algorithm and is essentially the opposite of avalanche. More information can be found at 'http://burtleburtle.net/bob/hash/evahash.html'.

Salient to Hashing

A property of attributes (or fields) of a type. An attribute is considered salient to hashing if it should be incorporated into the bytes that are used to produce a hash of a given object. As a general rule is that every attribute used in 'operator==' is usually salient to hashing. This term is explored more in depth later.

Why Use This System

There are numerous benefits to both type creators and users in this new system:

Better and More Predictable Performance

This modular hashing system allows users to use known good hashing algorithms to avoid performance issues. The graph below depicts 'unordered_map' (expected case O(1)) taking longer to find elements that map. This is a real-world benchmark. Internal users can read more about these tests at {TEAM BDE:FLAT MAP<GO>}. The anomalous behavior arose from the use of a poorly written, non-standard, hashing algorithm, which caused similar strings to hash to the same value resulting in many collisions:

Time to Find a pair<string, string> in map vs unordered_map
18 |
| O = map X = unordered_map
16 |
T | X
i 14 | |
m | /
e 12 | /-O
| /
i 10 | - /
n | - /
8 | O /
S | -- /
e 6 | - /
c | -- /
o 4 | ----O X
n | ----O--- --
d 2 | ---O----- ----
s | XO-------XO-------X--------X--------X
|____________________________________________________________________
| | | | | | |
1 10 100 1000 10000 100000 1000000
Elements in Collection

Less Code Duplication

In the standard C++03 hashing system, hashing algorithms were often copied into the 'bsl::hash' template specialization on each type. This resulted in lots of code duplication. In the new system, algorithms are not implemented directly on the type, so no algorithm duplication occurs. The little bit of code that is written on the type is specific to that type and wouldn't be copied elsewhere.

Easier to Make Types Hashable

In the standard C++03 hashing system, type implementers had to worry about writing a hashing algorithm for their type. This required figuring out what constituted a good hash value and how to generate one with a given set of data members. Thinking about what makes a good hash is no longer the job of the type implementer. They simply have to declare what data members they want to contribute the hash and then they are done.

Hash Combining

In the standard C++03 hashing system, if a type needed to use more than one data member in its hash calculation, there was no proper way for them to combine the hashes from multiple data members into one final hash. Poor methods of hash combining such as XORing could result in huge numbers of collisions. The new, modular hashing system offers 'hashAppend' to combine an unlimited number of data members into one, good, hash.

Swappable Algorithms

In the standard C++03 hashing system, anyone who wanted to hash a type had to use the algorithm written by the type implementer, or they had to write their own algorithm in a functor (without access to private data members of course). Unfortunately, hashing algorithms are not one size fits all. In some cases, a fast identity hash is fine. In other cases a slower hash that is secure against Denial of Service (DoS) attacks is required. The new, modular hashing system offers a suite of hashing algorithms that have been vetted and are know to have good characteristics for different situations. Swapping out one algorithm for another only requires a type's users to change one line of code, as shown here:

// Uses implicitly 'DefaultHashAlgorithm'.
// Only a single line change is required to use 'SpookyHashAlgorithm'
int,
Definition bslstl_unorderedmap.h:1089
Definition bslh_hash.h:580

Type Implementers

It is the type implementer's job to implement 'hashAppend' (explained below) on their type, much like they have to implement 'swap' on their type. It is important to note that implementing 'hashAppend' on 'MyType' is fully backward compatible with 'bsl::hash<MyType>'. That means that when you implement 'hashAppend' on 'MyNewType', 'bsl::hash<MyNewType>' will automatically pick it up without you ever having to write your own template specialization. For existing types that already have a 'bsl::hash<MyExistingType>' specialization, it is recommended that type implementers delete 'bsl::hash<MyExistingType>' once they have implemented 'hashAppend'. Deleting the 'bsl::hash' template specialization for 'MyExistingType' will not break any code because 'bsl::hash<TYPE>' automatically redirects to 'bslh::Hash<>'. Note that deleting the 'bsl::hash' template specialization for 'MyExistingType' will cause the hash value returned by calls to 'bsl::hash<MyExistingType>' to change, but this is explicitly allowed in the 'bsl::hash' contract.

hashAppend

The fundamental piece of this system, at least for type implementers, is the 'hashAppend' free function. What this free function does is pass the data members of a class which need to be hashed, into a hashing algorithm. It removes the need for the type implementer to actually write the hashing algorithm. An example implementation can be seen below:

namespace BloombergLP {
namespace NamespaceForBoxes {
class Box {
// A value semantic type that represents a box drawn on to a Cartesian
// plane.
private:
Point d_position;
int d_length;
int d_width;
public:
Box(Point position, int length, int width);
// Create a box with the specified 'length' and 'width', with its
// upper left corner at the specified 'position'
friend bool operator==(const Box &left, const Box &right);
template <class HASH_ALGORITHM>
friend
void hashAppend(HASH_ALGORITHM &hashAlg, const Box &box);
// Apply the specified 'hashAlg' to the specified 'box'
};
Box::Box(Point position, int length, int width)
: d_position(position)
, d_length(length)
, d_width(width)
{
}
bool operator==(const Box &left, const Box &right)
{
return (left.d_position == right.d_position)
&& (left.d_length == right.d_length)
&& (left.d_width == right.d_width);
}
template <class HASH_ALGORITHM>
void hashAppend(HASH_ALGORITHM &hashAlg, const Box &box)
{
hashAppend(hashAlg, box.d_position);
hashAppend(hashAlg, box.d_length);
hashAppend(hashAlg, box.d_width);
}
} // close package namespace
} // close enterprise namespace
bool operator==(const FileCleanerConfiguration &lhs, const FileCleanerConfiguration &rhs)
void hashAppend(HASH_ALGORITHM &hashAlg, const baljsn::EncoderTestAddress &object)
Definition baljsn_encoder_testtypes.h:9236
bsl::enable_if<(bsl::is_integral< TYPE >::value||bsl::is_pointer< TYPE >::value||bsl::is_enum< TYPE >::value)&&!bsl::is_same< TYPE, bool >::value >::type hashAppend(HASH_ALGORITHM &hashAlg, TYPE input)
Definition bslh_hash.h:638

A few key features of the 'hashAppend' fuction:

  1. 'hashAppend' is a free function that can be picked up through argument dependent look-up (ADL).
    1. In order to ensure 'hashAppend' can be found through ADL, the function must begin with 'using bslh::hashAppend'. Note that most of the time this isn't required because most 'HASH_ALGORITHM's will be implemented in 'bslh' and thus ADL will already be looking in 'bslh'. The using statement is still required, however, to support algorithms that are implemented outside of 'blsh'.
  2. 'hashAppend' is defined the header file of the type for which it is implemented.
  3. If 'hashAppend' requires access to private data, it is declared as a friend to the type which it is hashing.
  4. 'hashAppend' accepts a reference to a templated hashing algorithm functor, and a 'const' reference to the type which it is hashing.
  5. 'hashAppend' recursively calls 'hashAppend' on each of the salient-attributes of the type that it is hashing, propagating the hash algorithm functor with each invocation.

This is the extent of the work for a type implementer. Once a type implementer knows what members need to contribute to a hash value, the type implementer simply calls 'hashAppend' on members as shown above. Those members will then be passed into and used by whatever algorithm a consumer of the type wants to apply.

Determining what to Hash

Type implementers should be familiar with the rules for hash functions. There are two main rules:

  1. If 'x == y', then both 'x' and 'y' shall produce the same hash.
  2. If 'x != y', then 'x' and 'y' should not produce the same hash.

For example, if the first rule is violated, 'unordered_map' will encounter runtime errors. If the second rule is violated, collisions will occur and the performance of 'unordered_map' will suffer.

Since the rule about hashing are predicated on 'operator==', the easiest way to determine what to hash is to look at 'operator=='. For example, lets reexamine 'operator==' from our earlier code sample:

bool operator==(const Box &left, const Box &right)
{
return (left.d_position == right.d_position)
&& (left.d_length == right.d_length)
&& (left.d_width == right.d_width);
}

In order to ensure that the first rule of hashing if followed, we must only include data members in our hash if one of the following is true:

If we do not meet any of the above criteria, we open ourselves to the possibility that two objects that compare equal will hash to different values. If two equal objects hash to different values, hash-based data structures ('unordered_map' being the primary concern) will break. And example of something not to include in 'hashAppend' would be the 'capacity()' of a vector.

In order to make the second rule of hashing remain true, we should include everything that appears in 'operator==' in our hash. The more data we can put into the hashing algorithm, the higher the chances of us getting unique outputs. By following these suggestions, we produced the 'hashAppend' function from our earlier code example:

template <class HASH_ALGORITHM>
void hashAppend(HASH_ALGORITHM &hashAlg, const Box &box)
{
hashAppend(hashAlg, box.d_position);
hashAppend(hashAlg, box.d_length);
hashAppend(hashAlg, box.d_width);
}

It is worth noting that sometimes even data members that don't actually contribute entropy can still help us produce a more unique unique outputs. For example consider 'vector<vector<int> >'. If we just hashed the elements in the vector, then an empty 'vector<vector<int> >' would generate the same hash value as a 'vector<vector<int> >' containing one (or more) empty 'vector<int>' objects. This violates the second rule of hashing (above). By also passing 'vector.length()' (which is used in equality) into the algorithm, the two vectors will hash to different values.

Hashing Other User Defined Types

As we can see in code sample above showing 'hashAppend', 'hashAppend' is called on a data member of the user-defined type, 'Point'. This code will not compile if 'Point' is a type for which 'hashAppend' has not been implemented. The best route to take is to have the creator of 'Point' go back and add a 'hashAppend' function. If the creator, or somebody knowledgeable about the type, is not available to add the 'hashAppend' function, there is a work around. Assuming 'Point' supports the old system and has a 'bsl::hash<Point>' specialization, you can just hash the 'Point' and then pass the resulting 'size_t' into 'hashAppend' just like you would with any other integer data member. The following code sample shows how we can modify 'Box's 'hashAppend' function to handle 'Point' without a 'hashAppend' free function for 'Point':

template <class HASH_ALGORITHM>
void hashAppend(HASH_ALGORITHM &hashAlg, const Box &box)
{
hashAppend(hashAlg, bsl::hash<Point>()(box.d_position));
hashAppend(hashAlg, box.d_length);
hashAppend(hashAlg, box.d_width);
}
Definition bslstl_hash.h:498

In the above code sample, the first call to 'hashAppend' is now effectively calling 'hashAppend' on an integer type (the resulting hash from 'bsl::hash'), rather than on 'Point'. This trick allows new development to use this modular hashing system without having to wait for existing code to be upgraded.

Hashing Pointers (especially const char )

Pointers, particularly C-Strings (in the 'const char *' form), are an unfortunate exception in hashing. Because the standard mandates that we must be able to hash pointers, there is no way to distinguish a null terminated 'const char *' (C-String) from a regular pointer to a 'char'. Because of this, hashing C-Strings has slightly different semantics. Instead of calling 'hashAppend' on the pointer, we must pass our C-String directly into the hashing algorithm functor. All of the hashing algorithm functors in 'bslh' have the function signature shown below:

void operator()(const char *data, size_t length);
// Incorporates the specified 'data' of 'length' bytes into the internal
// state of the hashing algorithm.

As we can see, the hashing algorithm functor takes a pointer to the start of the data and a length in bytes. To hash a C-String, we call the hashing algorithm functor with our pointer and the length of the C-String which we have stored or pre-calculated. An example of this is the 'hashAppend' function for string, shown here:

template <class HASHALG,
class CHAR_TYPE,
class CHAR_TRAITS,
class ALLOCATOR>
void hashAppend(HASHALG& hashAlg,
const basic_string<CHAR_TYPE,
CHAR_TRAITS,
ALLOCATOR>& input)
{
hashAlg(input.data(), sizeof(CHAR_TYPE)*input.size());
}

This technique can be applied to any case where you want to hash contiguous data. It is especially important that the your data is completely contiguous, because padding bits and the like could result in the same data hashing to different values, which violates the first rule of hashing.

Converting Existing Types

The same process as above should be followed when implementing 'hashAppend' on user defined types which already specialize 'bsl::hash'. One extra step that is required is the 'bsl::hash' template specialization must be removed once 'hashAppend' has been implemented. 'unordered_map's will still function after the 'bsl::hash' specialization has been removed, since 'bsl::hash<TYPE>' automatically calls 'bslh::Hash<>' when no specialization exists. Note that deleting the 'bsl::hash' template specialization will cause the hash value returned by 'bsl::hash<YourType>' to change, however, this is explicitly allowed by the contract of 'bsl::hash'.

Type Users

The primary use case for 'bslh::Hash' is producing hash values for hash tables ('unordered_map's), so we will be focusing on that for the next example, but this section does apply generally to any use of the modular hashing system. The basic background knowledge required for this section is:

Beyond this basic summary, some knowledge of how 'bslh::Hash' and 'bsl::hash' interact is required for type users to get the most out of this system.

bsl::hash

Unfortunately, 'bsl::hash' is impossible to completely escape, even with the modular hashing system. The standard mandates that 'unordered_map's with a key of 'SomeType' will call 'bsl::hash<SomeType>' by default. We have done our best to make 'bsl::hash' and 'bslh::Hash' interact nicely in most scenarios, and the section below shows what behavior can be expected from calling 'bsl::hash<SomeType>'.

Note that when 'bsl::hash<SomeType>' redirects to 'bslh::Hash<>', 'bslh::Hash<>' is always using the defualt hashing algorithm. This is fine for most use cases, but if we need to use a special algorithm, such as a secure one to prevent Denial of Service (DoS) attacks in a hash table, we must directly use 'bslh::Hash' in order to swap out the algorithm template parameter.

bslh::Hash

There are various algorithms that can be swapped into 'bslh::Hash' as template parameters. Algorithms such as SipHash and SpookyHash ('bslh::SipHashAlgorithm' and 'bslh::SpookyHashAlgorithm' respectively) are implemented and can be swapped into 'bslh::Hash'. There are also a number of wrapper classes such as 'bslh::DefaultHashAlgorithm' and 'balh::DefaultSeededHashAlgorithm' which are named to allow you to pick them based on what you need, meaning you don't need an in depth knowledge of the individual hashing algorithms. The usage of these algorithms can be seen below:

// Implicitly uses 'bsl::hash<MyType>', may or may not redirect to
// 'bslh::Hash<>'.
// Implicitly uses 'bslh::DefaultHashAlgorithm', which redirects to the
// current best default algorithm for hashing.
// Explicitly uses 'bslh::DefaultHashAlgorithm', which redirects to the
// current best default algorithm for hashing.
unorderedMap;
// Explicitly uses 'bslh::SpookyHashAlgorithm', one of the algorithms
// available in 'bslh'.
unorderedMap;

bslh::SeededHash, bslh::SeedGenerator, and Secure Hashing

Some hashing algorithms, such as 'bslh::SipHashAlgorithm', require seeds to function. 'bslh::SipHashAlgorithm' was designed by its creators to provide protection against hash table DoS attacks. In order to get the most protection, 'bslh::SipHashAlgorithm' requires a cryptographically secure random number in order to produce hashes that will be secure against an attacker. 'bslh::SeededHash' and 'bslh::SeedGenerator' exist to facilitate passing seeds into hashing algorithms.