[Math] Why is the expected length of a chain in hashing equal to $n/m$ (the load factor) due to independence of collisions

computer scienceprobability

Say we have inserted $n$ keys and that we have $m$ slots and that the Simple Uniform Hashing Assumption holds (each item to be hashed has an equal probability of being placed into a slot, regardless of the other elements already placed). This means that all keys insertions are independent of each other and don't affect each other how one key goes to any slot. Does this independence assumption justify this reasoning: Let $n_j = T[j]$ be the length of a chain in position j. Then:

$$ \mathbb{E}[T[j]] = \sum^n_{i=1}\frac{1}{m}$$

and we can do this summation $n$ times due to independence. This argument seems completely wrong to me because independence as far as I remember guarantees $Pr[A,B] = Pr[A]Pr[B]$ (or $Pr[A \mid B] = Pr[A]$) but not summation. However, that is the explanation given here:

https://youtu.be/0M_kIqhwbFo?t=38m20s

at one of MIT's OCW lectures. Is the explanation the professor gave justified? Is the summation justified because of the independence, or why can we add things the way he suggested?

Best Answer

Unless I am missing something in the definition of the question, this indeed has nothing to do with independence. This is only linearity of expectation, which holds even for arbitrarily dependent random variables: $$\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$$ (provided these expectations all exist).

Here, we fix $j\in[m]$ and consider the random variables $X_1,\dots, X_n$, with $X_i$ corresponding to the indicator of the event "the $i$-th item goes to slot $j$." (This is a random variable taking value $0$ or $1$, and $\mathbb{E}\left[X_i\right]=1\cdot \mathbb{P}\{X_i=1\}+ 0 \cdot \mathbb{P}\{X_i=0\} = \mathbb{P}\{X_i=1\}$.)

Then $$\mathbb{E}[T_j] = \mathbb{E}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n \mathbb{E}\left[X_i\right] = \sum_{i=1}^n \mathbb{P}\{X_i=1\} = \sum_{i=1}^n \frac{1}{m} = \frac{n}{m}.$$