[Math] What are chance of two randomly generated 4 digit strings being the same

probability

I am building a website where an alphanumeric ID will be generated.

As of now, I have it randomly grabbing a number or letter from this data set:

ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

So there are 36 possibilities.

If my math is right, if I make the ID 4 digits/characters long, the odds of a duplicate would be:

1/36 * 1/36 * 1/36 * 1/36 which equals 1/1,679,616.

In a nutshell, I want to eliminate (within reason) the possibility of a duplicate ID being created, but the shorter the ID the better.

If I went with 4 digits and ended up creating 15,000 IDs (like 4T6H, GF29, etc.), what are the chances that two of them will be the exact same?

If I generated ONE, it's obviously 1 in 1,679,616 (i think). But wondering how the odds are affected if I generate 15,000 of them.

Any insight/corrections on my math would be greatly appreciated.

Best Answer

If there are $n$ possible strings (in our case, $n = 36^4$) and we randomly choose a random string $m$ (in our case, $m = 15\,000$), then the probability that the random strings we choose are all distinct is

$$1 \cdot \left(1 - \frac{1}{n}\right) \left(1 - \frac{2}{n}\right) \cdots \left(1 - \frac{m - 1}{n}\right) = \frac{n!}{n^m (n - m)!} .$$ Because of the magnitude of $n$ and $m$, this is perhaps nontrivial to estimate directly, so we briefly describe an approximation method good enough for order-of-magnitude estimates in our case: For $k \ll n$ the Taylor expansion of $\exp x$ gives $$\exp \left(-\frac{k}{n}\right) \approx 1 - \frac{k}{n},$$ and we can approximate the left-hand side of the first display equation as $$\prod_{k = 0}^{m - 1} \exp\left(-\frac{k}{n}\right) = \exp \left(-\frac{1}{n}\sum_{k = 0}^{m - 1} k \right) = \exp \left[-\frac{(m - 1)(m - 2)}{2n}\right] .$$ For $m \gg 0$, this gives the approximation $$P \approx \exp\left(-\frac{m^2}{2 n}\right)$$ given in the comments.

For our values of $m, n$ this gives that the probability the strings are all different is $$P \approx 8.150 \cdot 10^{-30} .$$

This is reasonably close to the actual value, which one can evaluate with a CAS: $$\color{#bf0000}{\boxed{P \approx 6.700 \cdot 10^{-30}}} .$$

Using our approximation, we see that using $8$-character strings gives an estimated collision probability smaller than $10^{-4}$.

Related Question