Calculate birthday problem for collisions of a random number generator

birthdaycombinationscombinatoricsrandom

I'm creating a script that generates random numbers for invoices as fully described here.

The current version of my script includes logic to prevent collisions, but this means that every time there is a collision, the script has to generate a new number. If the script is running into collisions and regenerating too often, this will be a problem, so I need to figure out how often I can expect a given keyspace to generate collisions, such as the figures calculated here.

For example, given a key space of 100000 for 5 digits (00000-99999), what percentage of collisions can I expect?

What's the formula to work this out, and how can I apply it (my maths skills are very limited)?

Best Answer

I'm going to intepret your question in the following precise way, which might not be what you intended, but is my best guess at what you meant:

Numbers will be chosen uniformly at random, one at a time, from a set of $n$ possible numbers (say the positive integers from $0$ to $n-1$ inclusive). At each step, if the number drawn has already been drawn, then it will be discarded. Otherwise, it will be added to a list of useable customer invoice numbers. This process will repeat until we have $m$ numbers (which will of course all be distinct from each other) in our list of invoice numbers. The idea being that our business has $m$ customers needing invoice numbers.

Let $Y$ be the total number of times we have to generate a number, until we have managed to get $m$ distinct ones. Then $Y$ is a random variable, taking values between $m$ and $\infty$ inclusive. We would like to know the expected value of $Y$, that is, what the "long-run average" value of $Y$ would be, if we were to repeat the whole experiment many times.

This is an instance of the "Coupon collector's problem". It can be shown (see for example pg. 216 of the paper "Birthday paradox, coupon collectors, caching algorithms and self-organising search" by Flajolet, Gardier and Thimonier in Discrete Applied Mathematics 39 (1992), pp. 207-229) that the expected value of $Y$ is given by \begin{equation} \tag{1} E(Y)=n(H_n - H_{n-m}), \end{equation} where $H_n$ is the $n$th harmonic number: \begin{equation} H_n = 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}, \end{equation} which Henry mentioned in his answer.

We can rewrite equation (1) as \begin{equation} E(Y)=n \left (\frac{1}{n-m+1} + \frac{1}{n-m+2} + \frac{1}{n-m+3} + \cdots + \frac{1}{n} \right ), \end{equation} but for a cheeky application of Wolfram Alpha, it's probably easier to use (1) as is. For example, let's suppose your keyspace has size one million ($n=10^6$) and that you have three and a half thousand customers ($m=3,500$). Then the expected number of calls to the random number generator is \begin{equation} E(Y)=10^6(H_{10^6}-H_{10^6 - 3500}). \end{equation} Wolfram Alpha says that this is about $3506$: Wolfram Alpha computation 1. You can use that link to play with the parameters $n$ and $m$. If you fix $m$ and increase $n$, you should see the expected number of calls decreases (as one would hope).

Bear in mind that the number of collisions is $Y-m$. However, the expected value of $Y-m$ is the same as the expected value of $Y$, minus $m$: \begin{equation} E(Y-m)=E(Y)-E(m)=E(Y)-m=n(H_n - H_{n-m})-m. \end{equation} Therefore in the example above, you would expect to have only about $3506-3500=6$ collisions on average.

You could do an experiment in code, to carry out many trials of the process, so that you could make a graph of average number of numbers drawn in the first $k$ trials, for $k$ between $1$ and $5000000$, say. You'd find that the points jumped around a lot, but eventually began to converge to the value of E(Y) predicted by the formula. This is an example of the "Law of Large Numbers". It gives an intuition for the meaning of the "expected value of a random variable".

EDIT:

In response to your recent comment above, thank you, and no problem ! It's a back-and-forth until we figure out what's what.

About that table, it looks to me as if the entries are given by the formula they give for $n(p;H)$, rounded to the nearest "nice" number. I found that out by trying a few on Wolfram Alpha. For example, Wolfram Alpha computation 2 gives about 77162.7, which appears in their table as 77000 (second row, second from last column). As I understand, the meaning of these numbers is: The least number of times you must generate a random number, before the probability that you have at least one collision is at least a certain amount. So if you generate 77162 random numbers uniformly from among 2^32 possible ones, then the probability that you'll have a collision is less than 50%, but if you generate 77163, then the probability that you'll have a collision is at least 50%. Note that this formula is only an approximation. The true threshold may not be exactly 77163, which is probably why they feel free to round it quite a lot, to 77000.

It sounds as if maybe you want to know how big the keyspace should be so that the probability of no collision is more than a certain value. That is, with probability at least such-and-such, you generate the $m$ invoice numbers you need, without any collisions. If so, then by rearranging their formula (and using the variable names I used above instead of theirs), we have \begin{equation} n \approx \frac{m^2}{2 \log \left ( \frac{1}{p} \right )}. \end{equation} They have used an approximation to $\log$ to get a simpler formula lower down the page, so let's do the same: \begin{equation} \tag{2} n \approx \frac{m^2}{2(1-p)}. \end{equation} Replacing $\log \left ( \frac{1}{p} \right )$ by $1-p$ is a reasonable approximation when $p$ is close to $1$ (which it will be in your application, if you want a high probability of no collisions). See the first plot here: Wolfram Alpha computation 3.

For example: You need $m=3500$ distinct random invoice numbers. You will choose them one at a time, uniformly at random, independently of each other, from a keyspace of size $n$. You want the probability of experiencing no collisions to be at least 99%. What's the smallest keyspace size $n$ that you can get away with ? It is approximately \begin{equation} \frac{3500^2}{2(1-0.99)} \approx 600000000, \end{equation} see here: Wolfram Alpha computation 4.

SECOND EDIT:

I have just checked the derivation of the approximation they use, and I found out that it gives an upper bound on the probability of no collisions. Therefore I am wrong to use the approximation to give lower bounds on that probability. However, the upper bound should be fairly close to the real value, provided $m$ is much less than $n$, so perhaps the above formula will still give you a useful rule of thumb. The exact value for the probability of no collisions is: \begin{equation} \tag{3} \prod_{k=1}^{m-1} \left ( 1 - \frac{k}{n} \right ) = \frac{1}{n^{m-1}} \frac{(n-1)!}{(n-m)!}. \end{equation} Indeed, if we plug that into Wolfram Alpha, using the $n=600000000$ calculated previously, and $m=3500$ as before, then we find that the probability of no collisions is about 98.98%, which is less than the 99% we'd hoped for: Wolfram Alpha computation 5 But it's pretty close, isn't it ?! If you wanted to insist on passing the 99% threshold, then you could just try larger values of $n$ one by one until you got what you needed. Perhaps you could use an interval bisection scheme, i.e. if you go too far, come half way back, etc. etc.

THIRD EDIT:

If instead of rounding the upper bound given by Wolfram Alpha computation 4 down to $600000000$, we used the actual upper bound, $612500000$, then it turns out to be enough to make the true value of the probability exceed 99%: Wolfram Alpha computation 6.

FOURTH EDIT:

To clarify, the formula (3) (you can use the left-hand side or the right-hand side, whichever is easier for you) will give you the exact probability of no collisions. You could use it to make a table of probabilities corresponding to different values of $n$, and then choose the smallest value of $n$ such that the probability was greater than 99% (or whatever you like).

To know what rough range of values of $n$ you should include in your table, you can start with the right-hand side of formula (2), and go up or down from that value of $n$, as needed. You should beware - the right-hand side of (2) MIGHT sometimes be quite far from the true threshold value of $n$ that you want. I fell into that trap myself - a mistake that I corrected in my "Second Edit". The problem there is like this: Suppose I'm trying to work out how old you are, and I know that your friend is older than you, and that he is less than $56$ years old. If your age is $x$ and his age is $y$ then I have $x < y < 56$, from which I know that you yourself are less than $56$. Suppose that you have another friend who is also older than you, let's say her age is $z$, and suppose I know that she is more than $56$ years old. Then I have $x < z$ and $z > 56$, but these two facts together don't help me at all with estimating your age, $x$: You might be $10$, $21$, $75$, $100$, I just don't know. The first pair of inequalities work together, but the second pair don't.

Related Question