BDE 4.14.0 Production release
|
Modules | |
bslh_defaulthashalgorithm | |
Provide a reasonable hashing algorithm for default use. | |
bslh_defaultseededhashalgorithm | |
Provide a reasonable seeded hashing algorithm for default use. | |
bslh_fibonaccibadhashwrapper | |
Provide a wrapper to improve "bad" hash algorithms. | |
bslh_filesystem | |
Provide hash for std::filesystem::path . | |
bslh_hash | |
Provide a struct to run bslh hash algorithms on supported types. | |
bslh_hashoptional | |
Provide hashAppend for std::optional . | |
bslh_hashpair | |
Provide hashAppend for std::pair . | |
bslh_hashtuple | |
Provide hashAppend for std::tuple . | |
bslh_hashvariant | |
Provide hashAppend for std::variant . | |
bslh_seededhash | |
Provide a struct to run seeded bslh hash algorithms on types. | |
bslh_seedgenerator | |
Provide a class to generate arbitrary length seeds for algorithms. | |
bslh_siphashalgorithm | |
Provide an implementation of the SipHash algorithm. | |
bslh_spookyhashalgorithm | |
Provide an implementation of the SpookyHash algorithm. | |
bslh_spookyhashalgorithmimp | |
Provide BDE style encapsulation of 3rd party SpookyHash code. | |
bslh_wyhashincrementalalgorithm | |
Provide an implementation of the WyHash algorithm final v3. | |
Provide a framework for hashing types using swappable algorithms.
Basic Standard Library Hashing (bslh)
The 'bslh' package provides components for a more modular hashing implementation than is found in the standard. This implementation is based on ISO C++ Proposal N3980. An internal proposal for this is available at {TEAM BDE:MODULAR HASHING<GO>}. This package provides hashing algorithms as well as 'Hash' and 'SeededHash' structs which allow different algorithms to be applied to any type that has a 'hashAppend' function. This document will explain the overall benefits of the system, what type implementers need to do to make their types hashable in this system, what type users need to do to hash different types, how to extend different pieces of this system, and will provide a summary of all the components in this package. All sections are independent and can be read and used without having read the other sections.
Changing one bit in the input to the hashing algorithm results in an "avalanche", which causes each output bit to have a 50% probability of changing. The avalanche property means that two very similar values will produce completely dissimilar hashes.
Within the context of hash tables, Denial of Service (DoS) attacks refer to an attacker causing many hash table keys to collide to the same bucket. These collisions cause look-ups in the hash table to become very time consuming linear searches.
An undesirable property of hashing algorithms that results in a large number of collisions when the inputs differ by only a few bits. Funneling is related to the avalanche property of a hashing algorithm and is essentially the opposite of avalanche. More information can be found at 'http://burtleburtle.net/bob/hash/evahash.html'.
A property of attributes (or fields) of a type. An attribute is considered salient to hashing if it should be incorporated into the bytes that are used to produce a hash of a given object. As a general rule is that every attribute used in 'operator==' is usually salient to hashing. This term is explored more in depth later.
There are numerous benefits to both type creators and users in this new system:
This modular hashing system allows users to use known good hashing algorithms to avoid performance issues. The graph below depicts 'unordered_map' (expected case O(1)) taking longer to find elements that map. This is a real-world benchmark. Internal users can read more about these tests at {TEAM BDE:FLAT MAP<GO>}. The anomalous behavior arose from the use of a poorly written, non-standard, hashing algorithm, which caused similar strings to hash to the same value resulting in many collisions:
In the standard C++03 hashing system, hashing algorithms were often copied into the 'bsl::hash' template specialization on each type. This resulted in lots of code duplication. In the new system, algorithms are not implemented directly on the type, so no algorithm duplication occurs. The little bit of code that is written on the type is specific to that type and wouldn't be copied elsewhere.
In the standard C++03 hashing system, type implementers had to worry about writing a hashing algorithm for their type. This required figuring out what constituted a good hash value and how to generate one with a given set of data members. Thinking about what makes a good hash is no longer the job of the type implementer. They simply have to declare what data members they want to contribute the hash and then they are done.
In the standard C++03 hashing system, if a type needed to use more than one data member in its hash calculation, there was no proper way for them to combine the hashes from multiple data members into one final hash. Poor methods of hash combining such as XORing could result in huge numbers of collisions. The new, modular hashing system offers 'hashAppend' to combine an unlimited number of data members into one, good, hash.
In the standard C++03 hashing system, anyone who wanted to hash a type had to use the algorithm written by the type implementer, or they had to write their own algorithm in a functor (without access to private data members of course). Unfortunately, hashing algorithms are not one size fits all. In some cases, a fast identity hash is fine. In other cases a slower hash that is secure against Denial of Service (DoS) attacks is required. The new, modular hashing system offers a suite of hashing algorithms that have been vetted and are know to have good characteristics for different situations. Swapping out one algorithm for another only requires a type's users to change one line of code, as shown here:
It is the type implementer's job to implement 'hashAppend' (explained below) on their type, much like they have to implement 'swap' on their type. It is important to note that implementing 'hashAppend' on 'MyType' is fully backward compatible with 'bsl::hash<MyType>'. That means that when you implement 'hashAppend' on 'MyNewType', 'bsl::hash<MyNewType>' will automatically pick it up without you ever having to write your own template specialization. For existing types that already have a 'bsl::hash<MyExistingType>' specialization, it is recommended that type implementers delete 'bsl::hash<MyExistingType>' once they have implemented 'hashAppend'. Deleting the 'bsl::hash' template specialization for 'MyExistingType' will not break any code because 'bsl::hash<TYPE>' automatically redirects to 'bslh::Hash<>'. Note that deleting the 'bsl::hash' template specialization for 'MyExistingType' will cause the hash value returned by calls to 'bsl::hash<MyExistingType>' to change, but this is explicitly allowed in the 'bsl::hash' contract.
The fundamental piece of this system, at least for type implementers, is the 'hashAppend' free function. What this free function does is pass the data members of a class which need to be hashed, into a hashing algorithm. It removes the need for the type implementer to actually write the hashing algorithm. An example implementation can be seen below:
A few key features of the 'hashAppend' fuction:
This is the extent of the work for a type implementer. Once a type implementer knows what members need to contribute to a hash value, the type implementer simply calls 'hashAppend' on members as shown above. Those members will then be passed into and used by whatever algorithm a consumer of the type wants to apply.
Type implementers should be familiar with the rules for hash functions. There are two main rules:
For example, if the first rule is violated, 'unordered_map' will encounter runtime errors. If the second rule is violated, collisions will occur and the performance of 'unordered_map' will suffer.
Since the rule about hashing are predicated on 'operator==', the easiest way to determine what to hash is to look at 'operator=='. For example, lets reexamine 'operator==' from our earlier code sample:
In order to ensure that the first rule of hashing if followed, we must only include data members in our hash if one of the following is true:
If we do not meet any of the above criteria, we open ourselves to the possibility that two objects that compare equal will hash to different values. If two equal objects hash to different values, hash-based data structures ('unordered_map' being the primary concern) will break. And example of something not to include in 'hashAppend' would be the 'capacity()' of a vector.
In order to make the second rule of hashing remain true, we should include everything that appears in 'operator==' in our hash. The more data we can put into the hashing algorithm, the higher the chances of us getting unique outputs. By following these suggestions, we produced the 'hashAppend' function from our earlier code example:
It is worth noting that sometimes even data members that don't actually contribute entropy can still help us produce a more unique unique outputs. For example consider 'vector<vector<int> >'. If we just hashed the elements in the vector, then an empty 'vector<vector<int> >' would generate the same hash value as a 'vector<vector<int> >' containing one (or more) empty 'vector<int>' objects. This violates the second rule of hashing (above). By also passing 'vector.length()' (which is used in equality) into the algorithm, the two vectors will hash to different values.
As we can see in code sample above showing 'hashAppend', 'hashAppend' is called on a data member of the user-defined type, 'Point'. This code will not compile if 'Point' is a type for which 'hashAppend' has not been implemented. The best route to take is to have the creator of 'Point' go back and add a 'hashAppend' function. If the creator, or somebody knowledgeable about the type, is not available to add the 'hashAppend' function, there is a work around. Assuming 'Point' supports the old system and has a 'bsl::hash<Point>' specialization, you can just hash the 'Point' and then pass the resulting 'size_t' into 'hashAppend' just like you would with any other integer data member. The following code sample shows how we can modify 'Box's 'hashAppend' function to handle 'Point' without a 'hashAppend' free function for 'Point':
In the above code sample, the first call to 'hashAppend' is now effectively calling 'hashAppend' on an integer type (the resulting hash from 'bsl::hash'), rather than on 'Point'. This trick allows new development to use this modular hashing system without having to wait for existing code to be upgraded.
Pointers, particularly C-Strings (in the 'const char *' form), are an unfortunate exception in hashing. Because the standard mandates that we must be able to hash pointers, there is no way to distinguish a null terminated 'const char *' (C-String) from a regular pointer to a 'char'. Because of this, hashing C-Strings has slightly different semantics. Instead of calling 'hashAppend' on the pointer, we must pass our C-String directly into the hashing algorithm functor. All of the hashing algorithm functors in 'bslh' have the function signature shown below:
As we can see, the hashing algorithm functor takes a pointer to the start of the data and a length in bytes. To hash a C-String, we call the hashing algorithm functor with our pointer and the length of the C-String which we have stored or pre-calculated. An example of this is the 'hashAppend' function for string, shown here:
This technique can be applied to any case where you want to hash contiguous data. It is especially important that the your data is completely contiguous, because padding bits and the like could result in the same data hashing to different values, which violates the first rule of hashing.
The same process as above should be followed when implementing 'hashAppend' on user defined types which already specialize 'bsl::hash'. One extra step that is required is the 'bsl::hash' template specialization must be removed once 'hashAppend' has been implemented. 'unordered_map's will still function after the 'bsl::hash' specialization has been removed, since 'bsl::hash<TYPE>' automatically calls 'bslh::Hash<>' when no specialization exists. Note that deleting the 'bsl::hash' template specialization will cause the hash value returned by 'bsl::hash<YourType>' to change, however, this is explicitly allowed by the contract of 'bsl::hash'.
The primary use case for 'bslh::Hash' is producing hash values for hash tables ('unordered_map's), so we will be focusing on that for the next example, but this section does apply generally to any use of the modular hashing system. The basic background knowledge required for this section is:
Beyond this basic summary, some knowledge of how 'bslh::Hash' and 'bsl::hash' interact is required for type users to get the most out of this system.
Unfortunately, 'bsl::hash' is impossible to completely escape, even with the modular hashing system. The standard mandates that 'unordered_map's with a key of 'SomeType' will call 'bsl::hash<SomeType>' by default. We have done our best to make 'bsl::hash' and 'bslh::Hash' interact nicely in most scenarios, and the section below shows what behavior can be expected from calling 'bsl::hash<SomeType>'.
Note that when 'bsl::hash<SomeType>' redirects to 'bslh::Hash<>', 'bslh::Hash<>' is always using the defualt hashing algorithm. This is fine for most use cases, but if we need to use a special algorithm, such as a secure one to prevent Denial of Service (DoS) attacks in a hash table, we must directly use 'bslh::Hash' in order to swap out the algorithm template parameter.
There are various algorithms that can be swapped into 'bslh::Hash' as template parameters. Algorithms such as SipHash and SpookyHash ('bslh::SipHashAlgorithm' and 'bslh::SpookyHashAlgorithm' respectively) are implemented and can be swapped into 'bslh::Hash'. There are also a number of wrapper classes such as 'bslh::DefaultHashAlgorithm' and 'balh::DefaultSeededHashAlgorithm' which are named to allow you to pick them based on what you need, meaning you don't need an in depth knowledge of the individual hashing algorithms. The usage of these algorithms can be seen below:
Some hashing algorithms, such as 'bslh::SipHashAlgorithm', require seeds to function. 'bslh::SipHashAlgorithm' was designed by its creators to provide protection against hash table DoS attacks. In order to get the most protection, 'bslh::SipHashAlgorithm' requires a cryptographically secure random number in order to produce hashes that will be secure against an attacker. 'bslh::SeededHash' and 'bslh::SeedGenerator' exist to facilitate passing seeds into hashing algorithms.