This is a collection of non-cryptographic hash functions intended to be used as general purpose hash functions. While there is, as of yet, no ultimate test to determine how "good" or "bad" a hash function is, there are statistical tests that can reveal undesired behavior; the results of a variety of such tests are provided for each function to help evaluate the efficacy of each function.
No! The Crap hash functions are indeed crap! Algorithms for general-purpose (translation) shows that they deteriorate badly under certain conditions. What is needed now is a reliable generic test which can detect this sort of behavior.
The numbers are the speed of the hash compared to the fastest speed overall. The first number are the totals for the smallest key sizes (1-20 bytes,7-27 bytes), the second for medium keys (12-33 bytes, 64-256 bytes), and the third for large keys (256kb,1mb). This will help illustrate why some inferior hash functions are still serviceable for smaller keys.
The output of each test will be an image where possible. Numbers are great, but there are too many to display in a space conscious fashion. Images, when done right, also offer the ability to tell at a glance if and where an algorithm has issues.
Mersenne Twister is used when random numbers are needed, and each algorithm uses the same set of random numbers. Avalanche, Seed Avalanche, and Bit Independence Criterion all do 2 million passes over the accumulation matrix to drive out most of the noise in the output images.
Avalanche is a measure of how the output bits change based on each input bit. Ideally each input bit will affect each output bit with 1/2 probability. Realistically, many hash functions do not achieve perfect avalanche and are still useable for many sets of keys.
The avalanche images use a much stricter coloring scheme than has been used before. It also differentiates between an output bit that always changes vs. one that never changes. The color key is
where Red is the output bit never changes when the input bit changes, Yellow is changing 33.3%/66.6% of the time, Green is 49.5%/50.5%, Black is exactly 50%, and White is changing 100% of the time. This gives a narrow <1% window for random noise in the mixing; anything higher and it will stick out like a sore thumb. The image width grows horizontally for the key width (in bits), and vertically for the hash size (in bits, i.e. 32).
Note that good or bad avalanching is in itself not a solid test of a good or bad hash! CRC32 has very bad avalanching yet is still very serviceable. SuperFastHash, on the other hand, has serious problems (problems that show up on alpha-numeric text) despite achieving fairly thorough avalanching.
Seed Avalanching is my term for how the hash output varies based on the initial state of the function. Seeding, or keying, a hash function, is vital to prevent collision attacks which force a hash table in to O(N) behavior. This test works by finding two colliding keys with a seed of 0 and hashing both keys with a variety of seeds and accumulating the differences between the resulting hashes.
Output coloring is the same as Avalanche, but as the seed is being changed instead of each input bit, the horizontal position represents sums of random seeds.
Bit Independence Criterion states that output bits j and k should change independently when any single input bit i is inverted, for all i, j and k. At this time I don't know how to exploit the results of the BIC test, but it can find minor flaws that are not evident from the avalanche tests.
Output coloring is the same as Avalanche, but the ordering is tricky. Horizontal still represents each input bit, but the vertical is now the j,k pairs of output bits. The spacers are drawn at the border when j changes, i.e. the initial rows represent the pairs 1,2, 1,3, 1,4, 1,5..., then the next section starts at 2,3, 2,4, 2,5... all the way down to the final bit pair of 31,32.
Bit Differentials are keys that vary by a set bit pattern, yet produce the same hash, e.g. if the keys 0x4444 and 0x5460, which vary by the three bit pattern 0x1024, produced the same hash. Differentials can be especially bad if the bit patterns occur in common data such as ascii text, rendering the algorithm dangerous for general use.
The differential color key is
Which hash function to choose depends on what you're looking for.
With that said, there are some hash functions/variants that can safely be ignored for _any_ purpose due to speed and mixing issues: Murmur (bad mixing), SuperFastHash (bad mixing), DJB/x31/x17 (too slow, horrible mixing), and FNV (too slow, bad mixing). Even improving these functions with more post-processing or tweaks to the mixing algorithm would still not make them better (speedwise or mixing-wise) than superior functions.
Main site should be found at http://team5150.com/~andrew/noncryptohashzoo.
If you would like a local copy of the archive you may download noncryptohashzoo.zip.