[Math] Expected number of colliding pairs in hashing (example)

expected valuehash functionprobability

Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the expected number of pairs of distinct keys that collide? (As above, distinct keys $x,y$ are said to collide if $h(x)=h(y)$)

  1. $n/m$
  2. $n/m^2$
  3. $n(n-1)/2m$
  4. $n^2/m$
  5. $n(n-1)/m$

This question came up in a quiz in Coursera assignment on Algorithms.

My thought is that there can be $n/m$ expected number of elements in one of any $m$ bucket, so pair of collisions is $\binom{n/m}{2}$ and this multiplied by $m$ gives total pairs colliding, but where is the flaw as answer according to this vision comes as $\color{red}{\frac{n(n-m)}{2m}}$

Best Answer

Let $\mathbf{1}_{ij}$ denote the indicator random variable for the event that keys $i$ and $j$ are mapped to the same bucket. You are then looking for $\mathbb{E}\left[ \sum_{1\leq i\leq j \leq n} \mathbf{1}_{ij} \right].$ By linearity of expectation, that is

$$ \sum_{1\leq i \leq j \leq n} \mathbb{E}[\mathbf{1}_{ij} ]$$

There are $\binom{n}{2} = \frac{n(n-1)}{2}$ such $i,j$ pairs that we are summing over. Each term is $$\mathbb{E}[\mathbf{1}_{ij}] = \mathbb{P}(\text{keys i and j are mapped to the same bucket}).$$

Since they are mapped uniformly and independently into $m$ buckets, the probability that two keys are mapped to the same bucket is $\frac{1}{m}.$ So the 3rd option is the correct answer.


The step going from "there are $n/m$ expected elements in any bucket" to "so the number of colliding pairs is.." does not work. For example, $n/m$ may not be an integer. Using this approach, we would need to figure out the distribution of the bucket counts. Say we have $2$ buckets and $2$ keys. The two buckets can have the following number of keys in them: $(0, 2), (0,2)$ and $(1,1). (1,1)$ occurs with probability $1/2$, the others have probability $1/4$. Then for each case, we apply your logic to get the number of pair collisions. For $(0,2)$, there's $0$ pairs colliding in bucket $1$ and $1$ pair in bucket $2$. So in this case, there is $1$ colliding pair and this case happens with probability $1/4.$ In the $(1,1)$ case there are $0$ colliding pairs in total, and this case has probability $1/2.$ In the $(0,2)$ case we again have $1$ colliding pair, and that case is probability $1/4.$ So the expected value for $n=2, m=2$ is $$1\cdot(1/4) + 0\cdot(1/2) + 1\cdot(1/4) = 1/2. $$

In principle we can do this same computation for general $n,m$ but these are the types of questions where linearity of expectation comes in so useful - when we only want an expected value, linearity of expectation often allows us to dodge actually computing the distribution.