[Math] Curious Properties of 33

computer sciencehash functionnumber theory

Because my explanation has so many words, I'll start with my question and then you can read the explanation if you need to:

The Bernstein Hash method uses the number 33 as a multiplier. From what I've read Bernstein himself has no reasonable explanation as to why 33 has such useful properties. I'm wondering if the mathematics community has any theories on the matter.


The explanation:

I'm a software engineer and I was recently working on a blog post about a hash function I was writing. In the process of designing my hash function, I looked at a lot of implementations of other hash functions.

For non-programmers, the gist is that the hashcodes produced by objects should be well distributed throughout the range of values in a 32 bit integer (-2,147,483,648 through 2,147,483,647).

Let's say I have the string "ABCD." A hash function would loop through each character, get the ASCII value of it, do something to it and aggregate it into a composite hashcode.

For example, in a lot of implementations, they'd take a hashcode initialized to a large prime, multiply it by another prime and the XOR it with the value of 'A' which is 65. Then, they'd take that and multiply it by the same prime and XOR that with the value of 'B'. They'd do this until the end of the string is reached.

I found a clever implementation in the Java framework code that loops over each item and effectively applies this: $i = ((i \ll 5) – i) \oplus j$. I was confused at first until I worked out that $(i \ll 5) – i = 31i$. For a computer, bit shifting is faster than multiplying so this is a clever way to multiply by a prime number.

So, I looked in Microsoft's .Net framework and I found that they do it a little differently. They use $(i \ll 5) + i$ instead! I couldn't figure out for the life of me why they used 33 instead of 31 because from what I understand multiplying by prime numbers is the foundation of many hashing and cryptographic functions.

I found out that this technique is called the Bernstein Hash and that Bernstein himself doesn't know why 33 produces such a good distribution of values as a multiplier in hashing functions.

Best Answer

I've always assumed that the factor 33 (which amounts to a 5-bit shift plus a addition, as you note) was chosen because 5 bits is roughly the significant content of (ASCII) text, hence the "spreading" is rather optimal for typical textual keys. I doubt something more can be said from the strictly mathematical point of view.

"from what I understand multiplying by prime numbers is the foundation of many hashing and cryptographic functions." That sounds rather vague to me: for example, for linear congruential generators, the multiplicator is frequently not prime. Empirically, it's seen that 31 or 33 make very little difference. Bear in mind that hashing functions usually relax the strict requirements of cryptographic functions, in favour of simplicity and performance.

Some reasonable heuristics and experimental results are given here.

See also this old discussion about the 33 vs 31 thing.

Related Question