Outline
Purpose
Provide a utility of hash functions.
Classes
- See also
- bdlc_hashtable2, bdeimp_inthash, bdeimp_strhash, bdlde_crc32
Description
This component defines a utility struct
, bdlb::HashUtil
, that provides two hash functions, hash1
and hash2
. These hash functions have a good funneling property, which can be formalized as documented below (see {Definitions and Formal Statement of No-Funneling Property}). In short, these hashes have the property that even if two inputs differ in only a few bits, they have a negligible probability of producing collisions. In addition, these two hash functions produce full 32-bit results and allow hash table sizes to be a power of two. They are fast and reliable. They should work equally well on all types of keys.
The two hash functions hash1
and hash2
have similar characteristics. hash2
is a bit faster than hash1
, especially for longer keys; it also does not produce many collisions as determined experimentally for the key types given in the example below. On the other hand, it does not have the formal no-funnel guarantees of hash1
which is reasonably fast and provably achieves low collision probability.
Furthermore, this component provides a third hash function hash0
that performs a simple hash by taking the modulus of a user specified size on the input key. This hash function has a poor distribution, as the function exhibits a clustering effect when the input keys only differ slightly. However, hash0
is about 6 to 8 times faster than hash1
and hash2
.
In general, use hash0
when speed matters, and hash1
or hash2
when a more thorough hash is needed (e.g., keys exhibit a pattern that results in many collision when using bdeimp
, or cost of collision is high but hashing is not a bottleneck).
Hashing Fundamental Types
In order to facilitate hashing user-defined types (see next section), this component supports hashing fundamental types in a manner that is both seemingly random (i.e., the hash is good) but predictable (i.e., identical on all platforms, irrespective of endianness). Note that the following expressions for some key
of a fundamental integral type:
static unsigned int hash2(const char *data, int length)
static unsigned int hash1(const char *data, int length)
return values that differ on big- and little-endian platforms if the key size is more than one byte. Instead, the recommended way to hash this key
is simply to:
This works if key
is of one of the following fundamental types:
char, signed char, unsigned char,
short, unsigned short,
int, unsigned int,
long, unsigned long,
float, double,
const void *
unsigned long long Uint64
Definition bsls_types.h:137
long long Int64
Definition bsls_types.h:132
Hashing User-defined Types
This component can be used for hashing a user-defined type as well. The recommended way to do this is to provide a hash
class method for the type, that takes an instance of the class and a size. For instance, let us consider the following class (the meaning and implementation are irrelevant here) with the following private data members:
class UserDefinedTickerType {
enum { CUSIP_LENGTH = 10 };
bool d_isConstant;
char d_cusip[CUSIP_LENGTH];
int d_value;
A hash class method can be provided and documented as follows:
public:
static int hash(const UserDefinedTickerType& object, int size);
};
The implementation of hash
can be very simple, depending on the semantics of the class. Here we assume that d_isConstant
is an implementation detail and that data members d_cusip
and d_value
should participate in the hash and that the character string should be hashed by value. The implementation follows:
int UserDefinedTickerType::hash(const UserDefinedTickerType& object,
int size)
{
}
Note that more intricate combinations of the individual hashes in the aggregate can be appropriate if the individual members in the aggregate my exhibit a pattern.
CAVEAT: When hashing user-defined classes, because of the potential padding in the class footprint, the simple-minded hashes
const UserDefinedTickerType& key;
may actually not produce the same hash values for copies of the same object.
Definitions and Formal Statement of No-Funneling Property
Let us say a given bit in the input affects a given bit in the output (which has v
bits) if two keys differing only at that input bit will differ at the given output bit about half the time. We define a hash to be a funneling hash if there is some subset of size t
of all the input bits that can affect only a subset of size u
of bits in the output, and t > u
and v > u
, and we say that the hash function has a funnel of those t
input bits into those u
bits. In that case, then u
of those t
bits can cancel out the effects of the other t - u
, and so the set of keys differing only in the input bits of the funnel can produce no more than half that number of hash values. (Those 2^t
keys can produce no more than 2^u
out of 2^v
hash values.) Differing in only a few bits is a common pattern in human and computer keys, so a funneling hash is seriously flawed.
This component offers a hash function (hash1
) that has no funnels, except for a funnel of 32 bits into 31 bits, which is a negligible vulnerability. The hash2
does not have known no-funneling properties, but operates on the same principle and is just as good (and faster) in experiments with chaining as in the usage below. In addition, these two hash functions produce full 32-bit results and allow hash table sizes to be a power of two.
For further details and references, the interested reader is invited to consult http://burtleburtle.net/bob/hash/evahash.html
.
Usage
This section illustrates intended use of this component.
Examples Introduction
We illustrate the usage of this component by some experiments (code and results) on the number of collision expected for various usage. For returning the results of an experiment, we will use the following structure:
struct ExperimentalResult {
int d_max;
double d_average;
double d_sigma;
explicit
ExperimentalResult(int max = 0, double avg = 0, double sigma = 0)
: d_max(max), d_average(avg), d_sigma(sigma)
{}
};
For generating the data, we use a parameterized GENERATOR
of which we give three implementations below (sequential integers, character strings, and multiple-field keys). This template parameter needs to validate the following syntactic expressions, which we call in this usage example our GeneratorConcept
. In those requirements, generator
is an instance of GENERATOR
.
const char *data = generator.data();
int length = generator.length();
generator.next();
Example 1: Using Chaining
The following code will check the number of collisions for a chaining-based hash table given a certain distribution of keys:
template <class GENERATOR>
ExperimentalResult computeChainingCollisions(int numElements,
int size,
GENERATOR input)
{
for (int i = 0; i < numElements; ++i, input.next()) {
unsigned int hash = Util::hash1(input.data(),
input.length());
++table[hash % size];
}
double avgLength = 0;
double sigmaLength = 0;
int maxLength = 0;
for (int k = 0; k < size; ++k) {
avgLength += table[k];
sigmaLength += table[k] * table[k];
maxLength = bsl::max(maxLength, table[k]);
}
sigmaLength = bsl::sqrt(sigmaLength / size - avgLength * avgLength);
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
Definition bslstl_vector.h:1025
bsl::size_t size(const TYPE &array)
Return the number of elements in the specified array.
Example 2: Using Double-Hashing
The following code will check the number of collisions for a double-hashing based hash table (such as bdlc_hashtable2 ) given a certain distribution of keys:
template <class GENERATOR>
ExperimentalResult computeDoubleHashingCollisions(int numElements,
int size,
GENERATOR input)
{
typedef typename GENERATOR::ResultType ResultType;
double avgLength = 0;
double sigmaLength = 0;
int maxLength = 0;
for (int i = 0; i < numElements; ++i, input.next()) {
unsigned int hash1 = Util::hash1(input.data(),
input.length());
int chainLength = 0;
int bucket = hash1 % size;
if (ResultType() == table[bucket]) {
table[bucket] = input.current();
} else {
unsigned int hash2 = Util::hash2(input.data(),
input.length());
hash2 = 1 + hash2 % (size - 1);
while (++chainLength < size) {
bucket = (bucket + hash2) % size;
if (ResultType() == table[bucket]) {
table[bucket] = input.current();
break;
}
}
}
if (chainLength == size) {
bsl::cerr << "\tCould not insert in doubly-hashed table\n";
avgLength = bsl::numeric_limits<double>::infinity();
sigmaLength = bsl::numeric_limits<double>::infinity();
maxLength = static_cast<int>(
bsl::numeric_limits<double>::infinity());
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
avgLength += chainLength;
sigmaLength += chainLength * chainLength;
maxLength = bsl::max(maxLength, chainLength);
}
avgLength /= numElements;
sigmaLength = bsl::sqrt(sigmaLength / numElements
- avgLength * avgLength);
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
Example 3: Creating a Generator of Sequential Integers
Here is a simple generator that returns a sequence of integers. This generator has a period of INT_MAX / gcd(increment, INT_MAX+1)
.
class SequentialIntegers {
int d_current;
int d_inc;
public:
typedef int ResultType;
explicit SequentialIntegers(int first = 1, int increment = 1)
: d_current(first), d_inc(increment) {}
void next()
{
d_current += d_inc;
}
ResultType current() const
{
return d_current;
}
{
return reinterpret_cast<const char*>(&d_current);
}
int length() const
{
return sizeof(ResultType);
}
};
BSLS_KEYWORD_CONSTEXPR CONTAINER::value_type * data(CONTAINER &container)
Definition bslstl_iterator.h:1231
Example 4: Creating a Character String Generator
Here is a simple generator that returns a sequence of strings that are a permutation of an initial string. This generator has a period of factorial(initial.length())
where factorial(N)
returns the number of permutations of N
distinct elements.
class SequentialStrings {
size_t d_length;
public:
: d_length(initial.length()), d_current(initial)
{
bsl::sort(d_current.
begin(), d_current.
end());
assert(bsl::unique(d_current.
begin(), d_current.
end()) ==
}
void next()
{
bsl::next_permutation(d_current.
begin(), d_current.
end());
}
ResultType current() const
{
return d_current;
}
{
}
size_t length() const
{
}
};
Definition bslstl_string.h:1281
size_type length() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6601
iterator end() BSLS_KEYWORD_NOEXCEPT
Return the past-the-end iterator for this modifiable string.
Definition bslstl_string.h:5452
CHAR_TYPE * data() BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:6477
void swap(basic_string &other) BSLS_KEYWORD_NOEXCEPT_SPECIFICATION(AllocatorTraits const_iterator begin() const BSLS_KEYWORD_NOEXCEPT
Definition bslstl_string.h:2533
Example 5: Creating a Multiple-Field Keys Generator
It is not uncommon for keys to consist of concatenated keys in multiple combinations. We simulate this by a character string in which each character loops through a specified number of values. The period of this generator is the product of the length of each character range.
struct SequentialVector {
size_t d_length;
public:
: d_ranges(ranges)
, d_length(ranges.size())
, d_current(d_length, (char)0)
{
}
void next()
{
for (size_t i = 0;
i < d_length && ++d_current[i] == d_ranges[i]; ++i) {
d_current[i] = 0;
}
}
ResultType current() const
{
return d_current;
}
{
}
size_t length() const
{
}
};
Example 6: Experimental Results
Checking the above distributions is done in a straightforward manner. We do this at various load factors (the load factor is the percentage of the table that actually holds data; it is defined as N / SIZE
where N
is the number of elements and SIZE
is the size of the hash table). Note that chaining accommodates arbitrary load factors, while double hashing requires that the load factor be strictly less than 1.
int usageExample(int verbose, int veryVerbose, int ) {
const int SIZE = 10000;
const int INC = SIZE / 5;
const int COLS = (4*SIZE)/INC;
{
ExperimentalResult results[3][COLS];
if (verbose) cout << "\nUsing chaining"
<< "\n--------------" << endl;
if (verbose) cout << "Sequential Integers\n";
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[0][i] = computeChainingCollisions(
n,
SIZE,
SequentialIntegers());
assert(results[0][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) cout << "Sequential Strings\n";
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[1][i] = computeChainingCollisions(
n,
SIZE,
SequentialStrings(initial));
assert(results[1][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) cout << "Sequential Vector\n";
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[2][i] = computeChainingCollisions(
n,
SIZE,
SequentialVector(ranges));
assert(results[2][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) {
cout << "\nDisplaying average (max) for chaining:";
cout << "\n--------------------------------------n";
const char *ROW_LABELS[] = { "\nIntegers:",
"\nStrings :",
"\nVector :",
"\nLoad factor:",
};
const int NUM_ROWS = sizeof ROW_LABELS
/ sizeof *ROW_LABELS - 1;
cout << ROW_LABELS[NUM_ROWS] << bsl::setprecision(2);
for (int n = INC; n < 4*SIZE; n += INC) {
cout << "\t" << static_cast<double>(n) / SIZE;
}
for (int j = 0; j < NUM_ROWS; ++j) {
cout << ROW_LABELS[j];
for (int i = 0; i < COLS; ++i) {
cout << "\t" << results[j][i].d_average;
cout << "(" << results[j][i].d_max << ")";
}
cout << "\n";
}
}
}
Definition bsls_stopwatch.h:149
void start(bool collectCpuTimes=false)
Definition bsls_stopwatch.h:348
void reset()
Definition bsls_stopwatch.h:339
double elapsedTime() const
Definition bsls_stopwatch.h:427
We repeat the same steps with double hashing, except that due to the nature of the collision-resolution mechanism, the tolerance for the average must be slightly increased to 2.5 times the load factor, when the load factor is high.
{
const int SIZE = 10007;
const int INC = SIZE / 5;
const int COLS = SIZE/INC;
ExperimentalResult results[3][COLS];
if (verbose) cout << "\nUsing double hashing"
<< "\n--------------------" << endl;
if (verbose) cout << "Sequential Integers\n";
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[0][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialIntegers());
assert(results[0][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) cout << "Sequential Strings\n";
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[1][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialStrings(initial));
assert(results[1][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) cout << "Sequential Vector\n";
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[2][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialVector(ranges));
assert(results[2][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout <<
"\t... in " << timer.
elapsedTime() <<
"\n";
if (verbose) {
cout << "\nDisplaying average (max) for double-hashing:";
cout << "\n--------------------------------------------\n";
const char *ROW_LABELS[] = { "\nIntegers:",
"\nStrings :",
"\nVector :",
"\nLoad factor:",
};
const int NUM_ROWS = sizeof ROW_LABELS
/ sizeof *ROW_LABELS -1;
cout << ROW_LABELS[NUM_ROWS] << bsl::setprecision(2);
for (int n = INC/2; n < SIZE; n += INC) {
cout << "\t" << static_cast<double>(n) / SIZE;
}
for (int j = 0; j < NUM_ROWS; ++j) {
cout << ROW_LABELS[j];
for (int i = 0; i < COLS; ++i) {
cout << "\t" << results[j][i].d_average;
cout << "(" << results[j][i].d_max << ")";
}
cout << "\n";
}
}
}
return 0;
}
The above code produces the following results. The results for chaining are identical for hash1
(used in the code above) or hash2
(code identical except hash2
is used in place of hash1
in computeChainingCollisions
).
Displaying average(max) for chaining:
---------------------------------------------------------------------------
Load factor: 0.2 0.4 0.6 0.8 1 1.2 1.4
Integers: 0.2(3) 0.4(5) 0.6(6) 0.8(6) 1(7) 1.2(7) 1.4(8)
Strings : 0.2(4) 0.4(4) 0.6(5) 0.8(7) 1(7) 1.2(7) 1.4(7)
Vector : 0.2(5) 0.4(5) 0.6(10) 0.8(10) 1(15) 1.2(15) 1.4(20)
Load factor: 1.6 1.8 2 2.2 2.4 2.6 2.8
Integers: 1.6(9) 1.8(9) 2(10) 2.2(10) 2.4(10) 2.6(10) 2.8(11)
Strings : 1.6(7) 1.8(8) 2(9) 2.2(11) 2.4(11) 2.6(11) 2.8(11)
Vector : 1.6(20) 1.8(24) 2(25) 2.2(29) 2.4(30) 2.6(34) 2.8(35)
Load factor: 3 3.2 3.4 3.6 3.8
Integers: 3(11) 3.2(11) 3.4(12) 3.6(12) 3.8(13)
Strings : 3(12) 3.2(12) 3.4(13) 3.6(13) 3.8(13)
Vector : 3(39) 3.2(40) 3.4(42) 3.6(45) 3.8(46)
Displaying average(max) for double-hashing:
---------------------------------------------------------------------------
Load factor: 0.1 0.3 0.5 0.7 0.9
Integers: 0.046(2) 0.20(4) 0.37(10) 0.71(15) 1.6(59)
Strings : 0.064(2) 0.20(6) 0.40(12) 0.75(18) 1.6(50)
Vector : 0.05(2) 0.19(5) 0.58(9) 1.20(15) 2.4(64)