[Math] Why choose a prime number as the number of slots for hashing function that uses divison method

algorithmshash functionprime numbers

The division method is one way to create hash functions. The functions take the form:

h(k) = k mod m

Where k is a key and m is the number of slots

Edit: If this is my hash function why should I choose m to be prime ? I understand the problem that if I choose m to be a power of 2 i.e m = 2^p, then the h(k) only looks at the p lower bits of k, completely ignoring the rest of the bits in k.
Why not choose a odd composite number ?

I have seen this answer https://stackoverflow.com/questions/3613102/why-use-a-prime-number-in-hashcode/3613423#3613423 but it kinds of explain the problems that we face if all the keys are of the form 4^p but doesn't quite explain why choose a prime but not a composite odd ? I also checked out https://stackoverflow.com/questions/5152015/why-setting-hashtables-length-to-a-prime-number-is-a-good-practice but then also I am not clear on it .

Best Answer

If you know nothing about the distribution of the key $k$, then there is absolutely no argument one could formulate for (or against) taking $m$ prime. There is much of a "feel-good" element in using primes: many authors will promote the practice (often without giving the "technical details" that would explain it) and Hey! even Knuth mentions prime numbers in the context of hashing (forget that he does so in a very particular setting that does not match yours), nobody ever gives an argument against using prime numbers, and there are plenty of them around, so why not choose a prime number? I'll repair one of the points I mentioned: choosing $m=2$ is almost always a very poor choice, even though $2$ is a perfectly bona fide prime number; it just is too small.

If your hashing by just reducing modulo $m$ (which is applicable only if your keys are integers, which severely limits the application contexts), then a hash modulus $m$ is relatively bad (with respect to its size) if somehow your keys are very unevenly distributed modulo $m$. Without knowing anything about the origin of the keys there is no way to know whether this is the case; if the keys are uniformly distributed in some interval much larger than $m$ it will never be the case. However if one surmises that keys are have a non-uniform distribution modulo certain unknown "cursed" numbers, then choosing some $m$ amounts to a gamble that this $m$ happens to not be one of those. It is always possible to lose out on this gamble. One can argue that if $m$ is cursed then so will any multiple of $m$, so choosing $m$ prime in some sense limits the risk to one gamble, while choosing a composite number exposes to the risk of a curse on any of its divisors as well. But you can't make a good argument out of this if one does not substantiate the fear that there are any cursed numbers in the first place. And if a set of (prime) numbers are known to be non-cursed, then taking their product is a safer bet then taking an unknown prime number of about the same magnitude.

But all this is mostly beside the point, because if you don't know the origin of your keys, then hashing by modular reduction is just not a good idea. In practice one should first "scramble" the data in the keys (this is what "hashing" means, actually), and then perform a reduction to map to the size of the table. I've written a general-purpose hashing class whose self-adjusting table size is always a power of $2$, which is an advantage in modular reduction and resizing (no need to hunt for an appropriate new size, just double). Used with a bit of attention to writing data-specific hash functions (if you know some lower order bits are likely to be zero, just don't use them) that ensure a uniform distribution, this works quite well in very large practical computations.