Outline
Purpose
Provide BDE style encapsulation of 3rd party SpookyHash code.
Classes
- See also
- bslh_hash, bslh_spookyhashalgorithm
Description
bslh::SpookyHashAlgorithmImp
provides BDE style encapsulation around Bob Jenkins' canonical SpookyHash implementation. SpookyHash provides a way to hash contiguous data all at once, or discontiguous data in pieces. More information is available at: http://burtleburtle.net/bob/hash/spooky.html
Usage
This section illustrates intended usage of this component.
Example: Creating 128-bit checksums
Suppose we have a library of 4 billion pieces of data and we want to store checksums for this data. For a 64-bit hash, there is a 35% chance of two of these checksums colliding (according to the approximation found here: http://en.wikipedia.org/wiki/Birthday_problem). We want to avoid checksum collision, so we will use the 128-bit hashing functionality provided by SpookyHashAlgorithmImp
.
First, we will declare a class CheckedData
which will store some data as well as the checksum associated with it.
class CheckedData {
size_t d_length;
const char *d_data;
Uint64 d_checksum1;
Uint64 d_checksum2;
public:
CheckedData(const char *data, size_t length);
const char *getData();
bool isDataValid();
};
unsigned long long Uint64
Definition bsls_types.h:137
Then, we define the CheckedData
constructor. Here we will use SpookyHashImp
to calculate a 128-bit checksum.
CheckedData::CheckedData(const char *data, size_t length)
: d_length(length)
, d_data(data)
, d_checksum1(0)
, d_checksum2(0)
{
SpookyHashAlgorithmImp hashAlg(1, 2);
hashAlg.hash128(d_data, d_length, &d_checksum1, &d_checksum2);
}
const char *CheckedData::getData() {
return d_data;
}
#define BSLS_ASSERT(X)
Definition bsls_assert.h:1804
Next, we define isDataValid
. We will generate a checksum from the contained data and then compare it to the checksum we generated when the class was created. If the two hashes match, then we can be reasonably certain that the data is still in a valid state (the chance of an accidental collision is very low). If the checksums do not match, we know that the data has been corrupted. We will not be able to restore it, but we will know not to trust it.
bool CheckedData::isDataValid() {
SpookyHashAlgorithmImp hashAlg(1, 2);
Uint64 checksum1 = 0;
Uint64 checksum2 = 0;
hashAlg.hash128(d_data, d_length, &checksum1, &checksum2);
return (d_checksum1 == checksum1) && (d_checksum2 == checksum2);
}
Then, we store some data in our CheckedData
class for safekeeping.
char data[] = "To be, or not to be--that is the question:"
"Whether 'tis nobler in the mind to suffer"
"The slings and arrows of outrageous fortune"
"Or to take arms against a sea of troubles"
"And by opposing end them. To die, to sleep--"
"No more--and by a sleep to say we end"
"The heartache, and the thousand natural shocks"
"That flesh is heir to. 'Tis a consummation"
"Devoutly to be wished. To die, to sleep--"
"To sleep--perchance to dream: ay, there's the rub,"
"For in that sleep of death what dreams may come"
"When we have shuffled off this mortal coil,"
"Must give us pause. There's the respect"
"That makes calamity of so long life."
"For who would bear the whips and scorns of time,"
"Th' oppressor's wrong, the proud man's contumely"
"The pangs of despised love, the law's delay,"
"The insolence of office, and the spurns"
"That patient merit of th' unworthy takes,"
"When he himself might his quietus make"
"With a bare bodkin? Who would fardels bear,"
"To grunt and sweat under a weary life,"
"But that the dread of something after death,"
"The undiscovered country, from whose bourn"
"No traveller returns, puzzles the will,"
"And makes us rather bear those ills we have"
"Than fly to others that we know not of?"
"Thus conscience does make cowards of us all,"
"And thus the native hue of resolution"
"Is sicklied o'er with the pale cast of thought,"
"And enterprise of great pitch and moment"
"With this regard their currents turn awry"
"And lose the name of action. -- Soft you now,"
"The fair Ophelia! -- Nymph, in thy orisons"
"Be all my sins remembered.";
CheckedData checkedData(data, strlen(data));
Now, we check that the CheckedData
recognizes that it is still valid.
ASSERT(checkedData.isDataValid());
Finally, we tamper with the data and check that our CheckedData
class can detect this.
data[34] = 'z';
ASSERT(!checkedData.isDataValid());
Changes
The third party code begins with the "SpookyHash" header below, and continues until the BloombergLP copyright notice. Changes made to the original code include:
- Added
BloombergLP
and bslh
namespaces
- Renamed
SpookyHash
to SpookyHashAlgorithmImp
- Removed usage of
stdint.h
(which might not be available on all platforms) and updated associated typedef
s
- Added
include
guards
- Made some methods private
- Reformatted comments and added comments
- Updated indenting to BDE style
- Moved
typedef
s within class
- Changed C-style casts to static_cast s
- Reordered methods according to BDE style
- Added inline to
Hash32
and Hash64
- Changed static constants to
enum
s to avoid storage overhead
- Added constructor in place of
init
- Made function names lower case (had to change
Final
to finalize
and Short
to shortHash
to avoid using a keyword)
- Reformatted comments
- Replaced some uses of 'inline' with the new
BSLH_SPOOKYHASHALGORITHMIMP_INLINE
macro, which forces inlining on GCC
Third-Party Documentation
SpookyHash: a 128-bit non cryptographic hash function
By Bob Jenkins, public domain
Oct 31 2010: alpha, framework + SpookyHash::Mix appears right
Oct 31 2011: alpha again, Mix only good to 2^^69 but rest appears right
Dec 31 2011: beta, improved Mix, tested it for 2-bit deltas
Feb 2 2012: production, same bits as beta
Feb 5 2012: adjusted definitions of uint* to be more portable
Mar 30 2012: 3 bytes/cycle, not 4. Alpha was 4 but wasn't thorough enough.
August 5 2012: SpookyV2 (different results)
Up to 3 bytes/cycle for long messages. Reasonably fast for short messages.
All 1 or 2 bit deltas achieve avalanche within 1% bias per output bit.
This was developed for and tested on 64-bit x86-compatible processors. It
assumes the processor is little-endian. There is a macro controlling
whether unaligned reads are allowed (by default they are). This should be
an equally good hash on big-endian machines, but it will compute different
results on them than on little-endian machines.
Google's CityHash has similar specs to SpookyHash, and CityHash is faster on
new Intel boxes. MD4 and MD5 also have similar specs, but they are orders
of magnitude slower. CRCs are two or more times slower, but unlike
SpookyHash, they have nice math for combining the CRCs of pieces to form the
CRCs of wholes. There are also cryptographic hashes, but those are even
slower than MD5.
◆ BSLH_SPOOKYHASHALGORITHMIMP_INLINE
#define BSLH_SPOOKYHASHALGORITHMIMP_INLINE inline |