Why are those hash functions considered a bad choice

hash function

Given Hashtable $T[0,..,m-1]$ and $U = \{0, . . . , n − 1 \}$ set of possible keys $k$ with $m \ll n$

Let $$h: U \rightarrow \{0, . . . , m − 1 \} $$

I am trying to understand why the hash functions $$h_{1}(k) =\Big\lfloor m\Big(\frac{k}{n}\Big)^2 \Big\rfloor$$
and

$$h_{2}(k) =\Big(l \Big\lfloor \frac{m}{n} k \Big\rfloor\Big) \text{mod } m \qquad l \in \mathbb{N}$$

would be considered a bad choice. I understand that a good hash function should have higher probability to be injective (Correct?) but I still can't see how this applies in these specific cases based on the functions formula.

I've tried different values for $h_1$ and $h_2$ and it's obvious that when $k_1$ and $k_2$ are close then $h(k_1) = h(k_2)$ Is there a way to formally prove this?

Help?

Best Answer

At least in computer science, a good hash function has as little as possible collisions (it makes the algorithm faster) -a collision happens when we have $k_{1} \neq k_{2}$ with the same hash value, $h(k_{1}) = h(k_{2})$. When a hash function is injective, it is called "perfect hash function"-it has no collisions. So, as you said, a good hash function should have higher probability to be injective. However, notice that with your condition $m<<n$, we cannot have $h:U\rightarrow \{0, 1, 2, ..., m-1\}$ injective, since the cardinality of $U$ $(|U|=n)$ is greater than the cardinality of $T$ $(|T|=m)$.

Looking to your hash function $h_{1}$, it seems that when $k$ and $m$ are too small compared to $n$, there will be many $k\in U$ such that $h_{1}(k) = 0$. For exemple, if $n=2^{m}$, $h_{1}(k)= 0$ for every $k < \frac{2^{m}} {\sqrt{m}}$. I think it would be better if you use just $$ h(k) = m \left\lfloor\dfrac{k}{n}\right\rfloor $$ This way you know you are distribuiting your values with the same quantity of collisions, what is good - unless you know beforehand what values you are going to use more. In fact, if you are going to use only a part $G\subset U$, you can look for a injective function (if $|G| \leq m$, of course). That is the best scenario for hash functions: when you have $\alpha \leq m$ numbers and you know who they are.

Your hash function $h_{2}$ looks fine, although you must choose a convenient $l$ - for example, if $l=pm, p\in \mathbb{N}$, $h_{2}(k) = 0\ \forall k\in U$.

About this "I've tried different values for h1 and h2 and it's obvious that when $k_{1}$ and $k_{2}$ are close then h(k1)=h(k2) Is there a way to formally prove this?": we can have $k_{1}$ and $k_{2}$ "really" close $(k_{2}=k_{1}+1)$ and $h_{2}(k_{1})$ completely different from $h_{2}(k_{2})$. However, $$ \left\lfloor\dfrac{mk_{1}}{n}\right\rfloor = \left\lfloor\dfrac{mk_{2}}{n}\right\rfloor \Rightarrow h_{2}(k_{1}) = h_{2}(k_{2}) $$

Perhaps a better choice for your hash function would be just $h(k)=(lk)$ mod $m$, where $l$ is preferably a prime number -and big enough so we don't have many $k$ such that $h(k)=0$. This way, even if $k_{1}$ and $k_{2}$ are close to each other, they are not likely to have the same hash value (and it is convenient for security reasons).

I said that if you know what values from $U$ you are going to use, you can search for a perfect hash function, but it is frequently done by brute force algorithms.

If you don't mind, I would like to recommend you this book, Open Data Structures, from Pat Morin.

Related Question