Threshold for the “number of UUIDs generated per millisecond” at which the collision probability of UUID v4 and UUID v7 is equal

computer scienceprobabilityrandom

I post this question here instead of StackOverflow because the mathematical element is stronger than engineering one. First, let me clarify the definition of terms.

  • UUID v4: Random value of $122$ bits ($2^{122}$ possible values)
  • UUID v7: Random value of $74$ bits ($2^{74}$ possible values), independent every millisecond

UUID v4 starts with an almost zero chance of collision, but as a certain number of UUIDs accumulate, the collision probability increases gradually due to the birthday paradox problem.

On the other hand, if UUID v7 is generated less than once per millisecond, the collision probability is absolutely zero. If it is generated more than twice per millisecond, collisions occur at a certain probability.

For simplicity, let's only consider what happens just after 50 years ($31557600000 \cdot 50$ milliseconds), and avoid treating time $t$ as a variable. If UUIDs are generated at a rate of $x$ per millisecond, after 50 years, $(31557600000 \cdot 50)x$ UUIDs will have accumulated.

  • UUID v4 is affected by the number of accumulated UUIDs, so it is necessary to consider both the collision probability between UUIDs that are about to be created and the collision probability with UUIDs created in the past.
  • For UUID v7, it is enough to consider only the collision probability between UUIDs that are about to be created.

Based on this, I want to know the minimum value of $x$ at which the collision probability of UUID v7 becomes higher than that of UUID v4 after 50 years. Similarly, I want to calculate the patterns for 0 seconds, 5 seconds, 1 year, 5 years, 10 years, 50 years, 100 years, 500 years, and 1000 years, and observe how they change.

I tried to get ChatGPT to solve this problem, but it gave me answers that seemed wrong, or said there were no solutions. Could it be that the number of UUIDs generated per millisecond $x$ is not a parameter, and everything is determined by the time $t$?

I'm not very good at math and I'm having trouble. I would appreciate your help.

Best Answer

Computing a formula

Suppose we generate $g$ UUIDs per millisecond for $T$ milliseconds total. Let $B(n, k) =$ (probability of no collisions in the Birthday Problem when we have $n$ days and $k$ birthdays). The same wikipedia article you linked includes formulas and approximations for $P$ which we'll use further down.

Using UUIDv4, we will have $Tg$ total UUIDs generated. Let $N_4$ be the total number of possible UUIDs, i.e. $N_4 = 2^{122}$. Then the probability of finding no collisions is $P_{v4} = B(N_4, Tg)$.

Using UUIDv7, we will generate $g$ UUIDs during the first millisecond. Let $N_7 = 2^{74}$ be the total number of possible UUIDs; then we get a non-collision probability of $p = B(N_7, g).$ That's just the probability of avoiding collision during one particular millisecond though. When we run the process for $T$ ms, the overall probability of getting zero collisions is the product of chance to get zero collisions in each individual ms: $$P_{v7} = p^T = B(N_7,g)^T.$$

Your original question was: Given some value of $T$, what choice of $g$ will make $P_{v4} = P_{v7}$?

The wikipedia article gives the exact formula $$B(n, b) = \frac{n!}{(n-b)! n^b}.$$

Using the formula to answer your question

The formula above is hard to work with due to the factorials of huge numbers, but fortunately I can just ask Mathematica to do the messy computations for me. Even so, I needed to do the computations in "log space" to keep the size of the numbers under control.

You wanted to consider fixed $t$ and find a cutoff value of $g$, but after playing around a bit I think we can get more insight by first fixing $g$ and looking for the cutoff value of $t$. Note that with $g$ fixed, if $t$ is small then UUIDv4 will win (fewer collisions) since v7's advantage is the independence per ms and small $t$ means there aren't many different milliseconds. If $t$ is big then v7 will win; for example, v4 will eventually just have guaranteed collisions once we've used up all $2^{122}$ possible values.

Here's my code:

$MaxExtraPrecision = 10000;
n4 = 2^122;
n7 = 2^74;

B[n_, k_] := Product[(n - j)/n, {j, 0, k - 1}]
logB[n_, k_] := Sum[Log[n - j] - Log[n], {j, 0, k - 1}]

(*log of P_4*)
lp4 = logB[n4, t g];
(*log of P_7*)
lp7 = t*logB[n7, g];

(*
Given g, this function finds the first t such that lp4[g,t] <= \
lp7[g,t].
This t is the "cutoff point", measured in milliseconds.
*)
binsearch[gg_] := Module[
   {pred, lo, hi, mid},
   pred[tt_] := Simplify[lp4 > lp7 /. {g -> gg, t -> tt}];
   lo = 1;  (* pred[lo] is true *)
   hi = 1; While[pred[hi], hi *= 2;]; 
   (* pred[hi] is false. *)
   While[hi - lo > 1,
    mid = Floor[(lo + hi)/2];
    If[pred[mid], lo = mid, hi = mid];
    ];
   Return[hi];
   ];
gvalues = {1, 2, 3, 5, 10, 10^2, 10^3, 10^4, 10^5, 10^10, 10^20, 
   10^22};
Table[{gvalues[[k]], binsearch[gvalues[[k]]]}, {k, 1, 
   Length[gvalues]}] // TableForm

and here are the results:

g        smallest t such that UUID_v7 wins

1        1
2        140737488355329
3        187649984473772
5        225179981368526
10       253327479039591
100      278660226943550
1000     281193501733946
10^4     281446829212985
10^5     281472161960889
10^10    281474976682509
10^20    281474976710656
10^22    281474976710656

Note that I stopped at $10^{22}$ because we can't use more than $g = 2^{74} \approx 1.89 \times 10^{22}$ hashes or else UUIDv7 will have guaranteed collision in the first ms.

Based on the results above, I conclude that as long as we generate at least $g=2$ UUIDs per millisecond, the cutoff point will be at least $140737488355329$ ms, which comes out to $4459.7$ years. If we are generating several UUIDs per second then the cutoff point will be around $281474976710656$ ms or $8919.4$ years.

In other words, if you are only focused on avoiding hash collisions then you should pretty much always choose v4 instead of v7. There may be other reasons for using UUID v7 though; for example, v7 will generate nearby UUIDs during each millisecond which can help with database index locality and thus improve runtime (see e.g. source). Overall, neither version is a strict upgrade over the other, and you just have to consider the tradeoffs when choosing a UUID algorithm for your specific application.