Estimate collision risk of *partially* random strings

combinatoricsconditional probabilityprobabilityrandom

I've been tasked with designing a compact (64 bits) ID, and I am trying to assess the risk of collisions based on various levels of throughput in the system. If all the bits are randomized, then I believe this is just the birthday problem, so we can estimate chance of $\ge1$ collision using:

$$
p = 1-q \approx 1-e^\frac{-n(n-1)}{2\times 2^{64}} \approx 1-e^\frac{-n^2}{2^{65}}
$$

Where $q$ is the probability that no collisions occur, and $n$ is the total number of IDs generated in the system. If I want to estimate risk over 5 years, then I just need to multiply throughput $\mu$ by some unit rate conversions to get the expected $n$ over 5 years.

However, completely random IDs aren't actually optimal for a lot of other reasons (readability, database storage…), so the alternative is to reduce the number of random bits and dedicate a prefix to some sort of date representation or timestamp, and this is where things start to fall apart.

My first thought was to divide the problem by prefix:

  1. Estimate the number of IDs per time-prefix, $n_p$, using $\mu$ and the length of time that a single prefix would cover.
  2. Use $n_p$ to arrive at the chance that there are no collisions $q_p$, using the above formula
  3. Calculate ${q_p}^t$ where $t$ is the number of periods in 5 years. This should give us an estimate of the chance that there are no collisions across all of the periods.
  4. Subtract from 1 to arrive at the risk that there is at least one repetition of a random suffix within one of the sets of ID's that share the same prefix.

$$
q = \prod^t_0{q_p} = {q_p}^t \approx \left({e^\frac{-{n_p}^2}{2^{R+1}}}\right)^t \\
p = 1-q \approx \left({e^\frac{-{n_p}^2}{2^{R+1}}}\right)^t
$$

Where $R$ is length of the randomly generated suffix of the id.

However, this approach breaks down if the average number of IDs per possible time-prefix is small. For example, if we're using a millisecond-precision timestamp, but we're only expecting $\mu = 1/sec$, then $n_p = \frac{1}{1000}$. The only thing I could think to do is to use the above, but setting $n_p = 2$ and $t$ equal to the number of expected time-prefix collisions over the span of 5 years. But one, I'm not sure this is reasonable, and two, I'm not sure how to calculate $t$.

I have a math degree (though I'm admittedly rusty as it's been a decade), so even recommendations on what to read up on would be appreciated. I had a thought to look into how UUID collision risk is calculated, but all I've been able to find is people focusing on the random part of the UUID and using birthday-problem math to demonstrate that the universe isn't old enough to expect a single collision yet. Unfortunately, I can't just throw more random bits at the problem!

Best Answer

38 bits is enough to assign a unique time stamp to each millisecond for the next 8.7 years, leaving 26 bits to randomly assign an individual, which is $1.84\times 10^{19}$ individuals.

If individuals are arriving randomly at $\mu=1$ per sec, then we can assume the waiting time between arrivals is exponentially distributed with rate mu. There is a ms collision if the waiting time is $<1$ ms, and the number of individuals arriving in 1 ms is Poisson distributed with $\lambda=10^{-3}$. We can check the probability that we expect $2, 3, \ldots, 10$ individuals to arrive in the same ms with (in R:) dpois(2:10, 1e-3).

Now, multiply that by $2^{38}$ to get the expected number over 8.7 years, which gives us $P(n=2)=137302$, $P(n=3)=46$, and $P(n\ge 4)=0$ (I checked that, it's about 0.0114 collisions for $n\in 4\ldots 100$, so we can safely stop at $n=3$).

So then we apply the birthday problem equation. We expect 137302 times where 2 individuals arrive in the same ms, and 46 times when 3 individuals arrive in the same ms, giving

\begin{align} p &\approx 1-\prod_n {1-P(\text{collision for $n$ arrivals})} \\ &= 1 - \exp\left(-\frac{2^2}{2^{27}}\right)^{137302} \times \exp\left(-\frac{3^2}{2^{27}}\right)^{46} \\ &= 0.9959134 \end{align}

In other words, there's a $0.409\%$ chance of 1+ collisions in 8.7 years, assuming you have individuals arriving at a rate of 1/sec.

Just to test a maybe more likely rate of $\mu=1$ per min we'd get the expected no. of collisions in the same ms as dpois(2:10, 1e-3/60) * 2^38, which gives 38 collisions where 2 arrive in the same ms, and a probability of a collision of $1.13\times 10^{-6}$.