[Math] probability of collision with randomly generated ID

probability

Reworded: If 10,000 events occur each second, and each event needs a unique ID, how many random bits does each ID require to insure collisions are minimized?

$10,000$ events could occur near the same time, so the time stamp is not sufficient for a unique ID. I want to add a number of random bits (but not too many) to make collisions unlikely. This sounds like the birthday paradox problem.

If I add $12$-bits of randomness, that too low because $10,000$ events into $4096$ boxes must have some duplicates.

If I add $18$-bits of randomness, $$p()=1-\left(\frac{262143}{262144}\right)^{(10000(10000-1))/2)} = 100 \%$$

If I add $24$-bits of randomness, $$p()=1-\left(\frac{16777215}{16777216}\right)^{10000(10000-1))/2} = 94.9 \%$$

Did I do this right? My math sense expects this to be more than enough, since each event has $1677$ possible places to go without collision.

At $32$ bits, there is a $1.1\%$ chance, and at $36$ bits the probability of a collision is $727$ parts per million.

I am starting to understand why the standard UUID generators use $128$ bits.

What do you think?

Best Answer

Indeed this is the birthday paradox/problem for a very long calendar. The calculations sketched are not exact ones (see below), but they are good approximations for such large numbers. I've added an example at bottom to show the difference appearing with smaller values.

Suppose there are $N$ boxes (where in the suggested application, $N=2^k$ for simplicity). We would compute the chance that no collision occurs in putting $m=10000$ items in those boxes randomly, and subtract from $1$ to get the probability some (at least one) collision occurs.

Assuming each of the ten-thousand items is assigned a box with uniform and independent probabilities, we can think of the universe of outcomes as $N^{10000}$ points. To count the outcomes which have no more than one item in any box, we need to multiply the combinations of $N$ things taken $10000$ at a time by the number of permutations of those $10000$ distinct boxes:

$$ \text{Pr}(\text{no collision}) = \frac{\binom{N}{10000} 10000!}{N^{10000}} $$

Of course we are dealing with some large numbers here, so either a symbolic math program or an approximation is called for.

Wolfram Alpha says the probability of no collision for $N=2^{18}$ is vanishingly small, about $10^{-84}$. For $N=2^{24}$ the value is $0.0507661...$, which agrees with the value shown in the Question.

Wolfram Alpha also agrees with the value for 32 bits at $0.988427...$ chance of no collision.

Added: It appears the idea of the calculations in the Question is to treat all possible $\binom{10000}{2}$ pairs of entries as if they had an independent chance of colliding. A small calculation will show this is not true, although the agreement for 10000 items is striking.

Suppose eight boxes in which we place four items. Now the method in the Question treats this as six independent pairs of items, each with a $7/8$ths chance of not colliding. For all six pairs not to collide has then probability:

$$ \left( \frac{7}{8} \right)^6 \approx 0.448795319 $$

But the method above computes the chance of no collision when placing four items at random in the eight boxes at:

$$ \frac{\binom{8}{4} 4!}{8^4} = \frac{105}{256} = 0.41015625 $$

To do the calculation in the way typical of "birthday problem" presentations, consider that the first choice can be made in any of eight out of eight ways without a collision, the second choice made in seven out of eight ways (to avoid colliding with the first choice), and so on:

$$ \frac{8}{8} \cdot \frac{7}{8} \cdot \frac{6}{8} \cdot \frac{5}{8} = \frac{105}{256} $$

The difference between "independent pair collisions" and the birthday problem is explained by the probabilities of the pairs colliding not being independent. For example, if items $A$ and $B$ do not collide, and items $A$ and $C$ do not collide, then the chance of $B$ and $C$ not colliding is no longer $\frac{N-1}{N}$ but rather $\frac{N-2}{N-1}$ since both $B$ and $C$ are known to avoid $A$.

Related Question