Provide a utility of hash functions.
More...
Namespaces |
namespace | bdlb |
Detailed Description
- Outline
-
-
- Purpose:
- Provide a utility of hash functions.
-
- Classes:
-
- See also:
- bdlc_hashtable2, bdeimp_inthash, bdeimp_strhash, Component bdlde_crc32
-
- Description:
- This component defines a utility
struct
, bdlb::HashUtil
, that provides two hash functions, hash1
and hash2
. These hash functions have a good funneling property, which can be formalized as documented below (see Definitions and Formal Statement of No-Funneling Property). In short, these hashes have the property that even if two inputs differ in only a few bits, they have a negligible probability of producing collisions. In addition, these two hash functions produce full 32-bit results and allow hash table sizes to be a power of two. They are fast and reliable. They should work equally well on all types of keys.
- The two hash functions
hash1
and hash2
have similar characteristics. hash2
is a bit faster than hash1
, especially for longer keys; it also does not produce many collisions as determined experimentally for the key types given in the example below. On the other hand, it does not have the formal no-funnel guarantees of hash1
which is reasonably fast and provably achieves low collision probability.
- Furthermore, this component provides a third hash function
hash0
that performs a simple hash by taking the modulus of a user specified size on the input key. This hash function has a poor distribution, as the function exhibits a clustering effect when the input keys only differ slightly. However, hash0
is about 6 to 8 times faster than hash1
and hash2
.
- In general, use
hash0
when speed matters, and hash1
or hash2
when a more thorough hash is needed (e.g., keys exhibit a pattern that results in many collision when using bdeimp
, or cost of collision is high but hashing is not a bottleneck).
-
- Hashing Fundamental Types:
- In order to facilitate hashing user-defined types (see next section), this component supports hashing fundamental types in a manner that is both seemingly random (i.e., the hash is good) but predictable (i.e., identical on all platforms, irrespective of endianness). Note that the following expressions for some
key
of a fundamental integral type: return values that differ on big- and little-endian platforms if the key size is more than one byte. Instead, the recommended way to hash this key
is simply to: This works if key
is of one of the following fundamental types: char, signed char, unsigned char,
short, unsigned short,
int, unsigned int,
long, unsigned long,
bsls::Types::Int64, bsls::Types::Uint64,
float, double,
const void *
-
- Hashing User-defined Types:
- This component can be used for hashing a user-defined type as well. The recommended way to do this is to provide a
hash
class method for the type, that takes an instance of the class and a size. For instance, let us consider the following class (the meaning and implementation are irrelevant here) with the following private data members: class UserDefinedTickerType {
enum { CUSIP_LENGTH = 10 };
bool d_isConstant;
char d_cusip[CUSIP_LENGTH];
int d_value;
A hash class method can be provided and documented as follows: public:
static int hash(const UserDefinedTickerType& object, int size);
};
The implementation of hash
can be very simple, depending on the semantics of the class. Here we assume that d_isConstant
is an implementation detail and that data members d_cusip
and d_value
should participate in the hash and that the character string should be hashed by value. The implementation follows: Note that more intricate combinations of the individual hashes in the aggregate can be appropriate if the individual members in the aggregate my exhibit a pattern.
- CAVEAT: When hashing user-defined classes, because of the potential padding in the class footprint, the simple-minded hashes may actually not produce the same hash values for copies of the same object.
-
- Definitions and Formal Statement of No-Funneling Property:
- Let us say a given bit in the input affects a given bit in the output (which has
v
bits) if two keys differing only at that input bit will differ at the given output bit about half the time. We define a hash to be a funneling hash if there is some subset of size t
of all the input bits that can affect only a subset of size u
of bits in the output, and t > u
and v > u
, and we say that the hash function has a funnel of those t
input bits into those u
bits. In that case, then u
of those t
bits can cancel out the effects of the other t - u
, and so the set of keys differing only in the input bits of the funnel can produce no more than half that number of hash values. (Those 2^t
keys can produce no more than 2^u
out of 2^v
hash values.) Differing in only a few bits is a common pattern in human and computer keys, so a funneling hash is seriously flawed.
- This component offers a hash function (
hash1
) that has no funnels, except for a funnel of 32 bits into 31 bits, which is a negligible vulnerability. The hash2
does not have known no-funneling properties, but operates on the same principle and is just as good (and faster) in experiments with chaining as in the usage below. In addition, these two hash functions produce full 32-bit results and allow hash table sizes to be a power of two.
- For further details and references, the interested reader is invited to consult
http://burtleburtle.net/bob/hash/evahash.html
.
-
- Usage:
- We illustrate the usage of this component by some experiments (code and results) on the number of collision expected for various usage. For returning the results of an experiment, we will use the following structure:
struct ExperimentalResult {
int d_max;
double d_average;
double d_sigma;
explicit
ExperimentalResult(int max = 0, double avg = 0, double sigma = 0)
: d_max(max), d_average(avg), d_sigma(sigma)
{}
};
For generating the data, we use a parameterized GENERATOR
of which we give three implementations below (sequential integers, character strings, and multiple-field keys). This template parameter needs to validate the following syntactic expressions, which we call in this usage example our GeneratorConcept
. In those requirements, generator
is an instance of GENERATOR
. const char *data = generator.data();
int length = generator.length();
generator.next();
-
- Example 1: Using Chaining:
- The following code will check the number of collisions for a chaining-based hash table given a certain distribution of keys:
template <class GENERATOR>
ExperimentalResult computeChainingCollisions(int numElements,
int size,
GENERATOR input)
{
bsl::vector<int> table(size);
for (int i = 0; i < numElements; ++i, input.next()) {
unsigned int hash = Util::hash1(input.data(),
input.length());
++table[hash % size];
}
double avgLength = 0;
double sigmaLength = 0;
int maxLength = 0;
for (int k = 0; k < size; ++k) {
avgLength += table[k];
sigmaLength += table[k] * table[k];
maxLength = bsl::max(maxLength, table[k]);
}
avgLength /= size;
sigmaLength = bsl::sqrt(sigmaLength / size - avgLength * avgLength);
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
-
- Example 2: Using Double-Hashing:
- The following code will check the number of collisions for a double-hashing based hash table (such as
bdlc_hashtable2
) given a certain distribution of keys: template <class GENERATOR>
ExperimentalResult computeDoubleHashingCollisions(int numElements,
int size,
GENERATOR input)
{
typedef typename GENERATOR::ResultType ResultType;
bsl::vector<ResultType> table(size);
double avgLength = 0;
double sigmaLength = 0;
int maxLength = 0;
for (int i = 0; i < numElements; ++i, input.next()) {
unsigned int hash1 = Util::hash1(input.data(),
input.length());
int chainLength = 0;
int bucket = hash1 % size;
if (ResultType() == table[bucket]) {
table[bucket] = input.current();
} else {
unsigned int hash2 = Util::hash2(input.data(),
input.length());
hash2 = 1 + hash2 % (size - 1);
while (++chainLength < size) {
bucket = (bucket + hash2) % size;
if (ResultType() == table[bucket]) {
table[bucket] = input.current();
break;
}
}
}
if (chainLength == size) {
bsl::cerr << "\tCould not insert in doubly-hashed table\n";
avgLength = bsl::numeric_limits<double>::infinity();
sigmaLength = bsl::numeric_limits<double>::infinity();
maxLength = static_cast<int>(
bsl::numeric_limits<double>::infinity());
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
avgLength += chainLength;
sigmaLength += chainLength * chainLength;
maxLength = bsl::max(maxLength, chainLength);
}
avgLength /= numElements;
sigmaLength = bsl::sqrt(sigmaLength / numElements
- avgLength * avgLength);
return ExperimentalResult(maxLength, avgLength, sigmaLength);
}
-
- Example 3: Creating a Generator of Sequential Integers:
- Here is a simple generator that returns a sequence of integers. This generator has a period of
INT_MAX / gcd(increment, INT_MAX+1)
. class SequentialIntegers {
int d_current;
int d_inc;
public:
typedef int ResultType;
explicit SequentialIntegers(int first = 1, int increment = 1)
: d_current(first), d_inc(increment) {}
void next()
{
d_current += d_inc;
}
ResultType current() const
{
return d_current;
}
const char *data() const
{
return reinterpret_cast<const char*>(&d_current);
}
int length() const
{
return sizeof(ResultType);
}
};
-
- Example 4: Creating a Character String Generator:
- Here is a simple generator that returns a sequence of strings that are a permutation of an initial string. This generator has a period of
factorial(initial.length())
where factorial(N)
returns the number of permutations of N
distinct elements. class SequentialStrings {
int d_length;
bsl::string d_current;
public:
typedef bsl::string ResultType;
explicit SequentialStrings(bsl::string const& initial)
: d_length(initial.length()), d_current(initial)
{
bsl::sort(d_current.begin(), d_current.end());
assert(bsl::unique(d_current.begin(), d_current.end()) ==
d_current.end());
}
void next()
{
bsl::next_permutation(d_current.begin(), d_current.end());
}
ResultType current() const
{
return d_current;
}
const char *data() const
{
return d_current.data();
}
int length() const
{
return d_current.length();
}
};
-
- Example 5: Creating a Multiple-Field Keys Generator:
- It is not uncommon for keys to consist of concatenated keys in multiple combinations. We simulate this by a character string in which each character loops through a specified number of values. The period of this generator is the product of the length of each character range.
struct SequentialVector {
bsl::vector<char> d_ranges;
int d_length;
bsl::string d_current;
public:
typedef bsl::string ResultType;
explicit SequentialVector(bsl::vector<char> const& ranges)
: d_ranges(ranges)
, d_length(ranges.size())
, d_current(d_length, (char)0)
{
}
void next()
{
for (int i = 0;
i < d_length && ++d_current[i] == d_ranges[i]; ++i) {
d_current[i] = 0;
}
}
ResultType current() const
{
return d_current;
}
const char *data() const
{
return d_current.data();
}
int length() const
{
return d_current.length();
}
};
-
- Example 6: Experimental Results:
- Checking the above distributions is done in a straightforward manner. We do this at various load factors (the load factor is the percentage of the table that actually holds data; it is defined as
N / SIZE
where N
is the number of elements and SIZE
is the size of the hash table). Note that chaining accommodates arbitrary load factors, while double hashing requires that the load factor be strictly less than 1. int usageExample(int verbose, int veryVerbose, int ) {
const int SIZE = 10000;
const int INC = SIZE / 5;
const int COLS = (4*SIZE)/INC;
{
ExperimentalResult results[3][COLS];
bsls::Stopwatch timer;
if (verbose) cout << "\nUsing chaining"
<< "\n--------------" << endl;
if (verbose) cout << "Sequential Integers\n";
timer.start();
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[0][i] = computeChainingCollisions(
n,
SIZE,
SequentialIntegers());
assert(results[0][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) cout << "Sequential Strings\n";
timer.start();
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
bsl::string initial("abcdefghij");
results[1][i] = computeChainingCollisions(
n,
SIZE,
SequentialStrings(initial));
assert(results[1][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) cout << "Sequential Vector\n";
timer.start();
for (int n = INC, i = 0; n < 4*SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
bsl::vector<char> ranges(6, static_cast<char>(4));
results[2][i] = computeChainingCollisions(
n,
SIZE,
SequentialVector(ranges));
assert(results[2][i].d_average < 1.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) {
cout << "\nDisplaying average (max) for chaining:";
cout << "\n--------------------------------------n";
const char *ROW_LABELS[] = { "\nIntegers:",
"\nStrings :",
"\nVector :",
"\nLoad factor:",
};
const int NUM_ROWS = sizeof ROW_LABELS
/ sizeof *ROW_LABELS - 1;
cout << ROW_LABELS[NUM_ROWS] << bsl::setprecision(2);
for (int n = INC; n < 4*SIZE; n += INC) {
cout << "\t" << static_cast<double>(n) / SIZE;
}
for (int j = 0; j < NUM_ROWS; ++j) {
cout << ROW_LABELS[j];
for (int i = 0; i < COLS; ++i) {
cout << "\t" << results[j][i].d_average;
cout << "(" << results[j][i].d_max << ")";
}
cout << "\n";
}
}
}
We repeat the same steps with double hashing, except that due to the nature of the collision-resolution mechanism, the tolerance for the average must be slightly increased to 2.5 times the load factor, when the load factor is high. {
const int SIZE = 10007;
const int INC = SIZE / 5;
const int COLS = SIZE/INC;
ExperimentalResult results[3][COLS];
bsls::Stopwatch timer;
if (verbose) cout << "\nUsing double hashing"
<< "\n--------------------" << endl;
if (verbose) cout << "Sequential Integers\n";
timer.start();
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
results[0][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialIntegers());
assert(results[0][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) cout << "Sequential Strings\n";
timer.start();
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
bsl::string initial("abcdefghij");
results[1][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialStrings(initial));
assert(results[1][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) cout << "Sequential Vector\n";
timer.start();
for (int n = INC/2, i = 0; n < SIZE; n += INC, ++i) {
if (veryVerbose)
cout << "\tWith load factor " <<
static_cast<double>(n) / SIZE << "\n";
bsl::vector<char> ranges(7, static_cast<char>(8));
results[2][i] = computeDoubleHashingCollisions(
n,
SIZE,
SequentialVector(ranges));
assert(results[2][i].d_average < 2.5 * n / SIZE);
}
if (verbose) cout << "\t... in " << timer.elapsedTime() << "\n";
timer.reset();
if (verbose) {
cout << "\nDisplaying average (max) for double-hashing:";
cout << "\n--------------------------------------------\n";
const char *ROW_LABELS[] = { "\nIntegers:",
"\nStrings :",
"\nVector :",
"\nLoad factor:",
};
const int NUM_ROWS = sizeof ROW_LABELS
/ sizeof *ROW_LABELS -1;
cout << ROW_LABELS[NUM_ROWS] << bsl::setprecision(2);
for (int n = INC/2; n < SIZE; n += INC) {
cout << "\t" << static_cast<double>(n) / SIZE;
}
for (int j = 0; j < NUM_ROWS; ++j) {
cout << ROW_LABELS[j];
for (int i = 0; i < COLS; ++i) {
cout << "\t" << results[j][i].d_average;
cout << "(" << results[j][i].d_max << ")";
}
cout << "\n";
}
}
}
return 0;
}
The above code produces the following results. The results for chaining are identical for hash1
(used in the code above) or hash2
(code identical except hash2
is used in place of hash1
in computeChainingCollisions
). Displaying average(max) for chaining:
---------------------------------------------------------------------------
Load factor: 0.2 0.4 0.6 0.8 1 1.2 1.4
Integers: 0.2(3) 0.4(5) 0.6(6) 0.8(6) 1(7) 1.2(7) 1.4(8)
Strings : 0.2(4) 0.4(4) 0.6(5) 0.8(7) 1(7) 1.2(7) 1.4(7)
Vector : 0.2(5) 0.4(5) 0.6(10) 0.8(10) 1(15) 1.2(15) 1.4(20)
Load factor: 1.6 1.8 2 2.2 2.4 2.6 2.8
Integers: 1.6(9) 1.8(9) 2(10) 2.2(10) 2.4(10) 2.6(10) 2.8(11)
Strings : 1.6(7) 1.8(8) 2(9) 2.2(11) 2.4(11) 2.6(11) 2.8(11)
Vector : 1.6(20) 1.8(24) 2(25) 2.2(29) 2.4(30) 2.6(34) 2.8(35)
Load factor: 3 3.2 3.4 3.6 3.8
Integers: 3(11) 3.2(11) 3.4(12) 3.6(12) 3.8(13)
Strings : 3(12) 3.2(12) 3.4(13) 3.6(13) 3.8(13)
Vector : 3(39) 3.2(40) 3.4(42) 3.6(45) 3.8(46)
Displaying average(max) for double-hashing:
---------------------------------------------------------------------------
Load factor: 0.1 0.3 0.5 0.7 0.9
Integers: 0.046(2) 0.20(4) 0.37(10) 0.71(15) 1.6(59)
Strings : 0.064(2) 0.20(6) 0.40(12) 0.75(18) 1.6(50)
Vector : 0.05(2) 0.19(5) 0.58(9) 1.20(15) 2.4(64)