After executing the code
RNGkind(kind="Mersenne-Twister") # the default anyway
set.seed(123)
n = 10^5
x = runif(n)
print(x[22662] == x[97974])
TRUE
is output!
If I use, e.g., RNGkind(kind="Knuth-TAOCP-2002")
similarly happens: I get "only" 99 995 different values in x
. Given the periods of both random generators, the results seem highly unlikely.
Am I doing something wrong? I need to generate at least one million random numbers.
I am using Windows 8.1 with R version 3.6.2; Platform: x86_64-w64-mingw32/x64 (64-bit) and RStudio 1.2.5033.
Additional findings:
- Having a bag with $n$ different balls, we choose a ball $m$ times and put it back every time. The probability $p_{n, m}$ that all chosen balls are different is equal to ${n\choose m} / (n^m m!)$.
- R documentation points to a link where the implementation of Mersenne-Twister for 64-bit machines is available: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
The uniform sampling from $[0, 1]$ interval is obtained via choosing a random 64-bit integer first, so I computed the above probabilities for the 64-bit and (when $p_{2^{64}, 10^5}$ turned out to be rather low) 32-bit case:
$$
p_{2^{64}, 10^5}\doteq 0.9999999999972… \qquad p_{2^{32}, 10^5} \doteq 0.3121…
$$
Then, I tried 1000 random seeds and compute the proportion of the cases when all generated numbers are different: 0.303.
So, currently, I assume that for some reason, 32-bit integers are actually used.
Best Answer
The documentation of R on random number generation has a few sentences at its end, that confirm your expectation of 32-bit integers being used and might explain what you are observing:
So the implementation in R seems to be different to what is explained on the website of the authors of the Mersenne Twister. Possibly combining this with the Birthday paradox, you would expect duplicates with only 2^16 numbers at a probability of 0.5, and 10^5 > 2^16. Trying the Wichmann-Hill algorithm as suggested in the documentation:
Note that the original Wichmann-Hill random number generator has the property that its next number can be predicted by its previous, and therefore does not meet non-predictability requirements of a valid PRNG. See this document by Dutang and Wuertz, 2009 (section 3)