Solved – Random Balls in Random Buckets: What are the characteristics of the distribution

distributionsuniform distribution

I have N buckets, numbered 1 to N.

I draw k random integers, uniformly distributed in the range 1 to N, with replacement, and for each integer I drop a ball into the corresponding bucket. k can be any size; specifically it can be any value from 2 to N, or larger than N.

  • What are the statistical characteristics (probability distribution,
    etc) of the numbers of balls in each bucket at the end?
  • What are the
    statistical characteristics of the numbers of buckets with 0,1,…k
    balls?

This question arises from a need to measure the 'goodness' (in some sense) of a hashing algorithm. Given a sample of distribution of keys among buckets, I need a measure of where it lies on a scale from 'Very good' to 'Very bad', and to be able to work out things like 'what are the chances of more than x balls in a bucket given k and N?'

Obviously, from the way I've written this, and from the random names I've assigned to my variables, I have absolutely no statistical sophistication. Please be gentle; I want to learn. Please feel free, for example, to change the variable names to something more conventional, or anything else.

Best Answer

Presuming each bucket is chosen with equal probability, and the balls are dropped independently, then the number of balls in each bucket would follow a multinomial distribution with $k$ trials and $p_1 = p_2 = \cdots = p_N = 1/N$. The probability of a particular bucket containing $x$ balls would be given by the binomial probability, which in this case is $\Pr(X=x) = \frac{k!}{x!(k-x)!} \frac{(N-1)^{k-x}}{N^k}$. You can use this to obtain $\Pr(X \geq x)$. If $k$ is large and $k/N$ is of reasonable magnitude, then we can use a Poisson distribution approximation (or a normal approximation).

The expected number of buckets containing exactly $x$ balls would just be $N\Pr(X=x)$. Of course, the totals in the buckets are not independent, but the totals of a few (say $M$) buckets will be approximately independent if $k$ is large compared to $M$. Thus we could approximate the variance of the number of buckets containing $x$ balls by $N\Pr(X=x) (1 - \Pr(X=x))$.

For example, if we have 1000 balls and 100 buckets, we'd expect 9.5 buckets to contain exactly 12 balls, but the standard deviation of this number is 2.9, so there is approximately a 95% probability it would lie between 4 and 15. (See R code).

> p = dbinom(12, size=1000, prob=1/100)
> 100*p
[1] 9.516152
> sqrt(100*p*(1-p))
[1] 2.934379
> 100*p + c(-1, 1)*1.96*sqrt(100*p*(1-p))
[1]  3.764769 15.267534