R Uniform Distribution – How to Fix ‘Runif’ Generated Number Repeats in Less Than 100,000 Steps

rrandom-generationuniform distribution

After executing the code

RNGkind(kind="Mersenne-Twister")  # the default anyway
set.seed(123)
n = 10^5
x = runif(n)
print(x[22662] == x[97974])

TRUE is output!

If I use, e.g., RNGkind(kind="Knuth-TAOCP-2002") similarly happens: I get "only" 99 995 different values in x. Given the periods of both random generators, the results seem highly unlikely.

Am I doing something wrong? I need to generate at least one million random numbers.

I am using Windows 8.1 with R version 3.6.2; Platform: x86_64-w64-mingw32/x64 (64-bit) and RStudio 1.2.5033.


Additional findings:

  1. Having a bag with $n$ different balls, we choose a ball $m$ times and put it back every time. The probability $p_{n, m}$ that all chosen balls are different is equal to ${n\choose m} / (n^m m!)$.
  2. R documentation points to a link where the implementation of Mersenne-Twister for 64-bit machines is available: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html

The uniform sampling from $[0, 1]$ interval is obtained via choosing a random 64-bit integer first, so I computed the above probabilities for the 64-bit and (when $p_{2^{64}, 10^5}$ turned out to be rather low) 32-bit case:
$$
p_{2^{64}, 10^5}\doteq 0.9999999999972… \qquad p_{2^{32}, 10^5} \doteq 0.3121…
$$

Then, I tried 1000 random seeds and compute the proportion of the cases when all generated numbers are different: 0.303.

So, currently, I assume that for some reason, 32-bit integers are actually used.

Best Answer

The documentation of R on random number generation has a few sentences at its end, that confirm your expectation of 32-bit integers being used and might explain what you are observing:

Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

So the implementation in R seems to be different to what is explained on the website of the authors of the Mersenne Twister. Possibly combining this with the Birthday paradox, you would expect duplicates with only 2^16 numbers at a probability of 0.5, and 10^5 > 2^16. Trying the Wichmann-Hill algorithm as suggested in the documentation:

RNGkind(kind="Wichmann-Hill") 
set.seed(123)
n = 10^8
x = runif(n)
length(unique(x))
# 1e8

Note that the original Wichmann-Hill random number generator has the property that its next number can be predicted by its previous, and therefore does not meet non-predictability requirements of a valid PRNG. See this document by Dutang and Wuertz, 2009 (section 3)