MATLAB: Does MATLAB have a Birthday Problem

birthday paradoxmathematicsrandomrandom vectors

The Birthday Paradox or problem asks for the probability that in a room of n people, 2 or more have the same birthday (not date), assuming all years have N = 365 days. It is called a paradox because most people are surprised by the answer when there are (say) 30 people in the room.
Many applications require long sequences (or large vectors) of random numbers. In MATLAB these are supplied by the simple statement r = rand(n,1). This statement fills the vector r(n,1) with double precision random numbers uniformly distributed on (0,1).
Here is my question:
What is the probability that duplicates occur in r = rand(n,1), as n gets large?
I have an answer to this question but I would like to see what others come up with so that I can check the validity of my answer.

Best Answer

Reply from Derek O'Connor:
I must apologize for the omission of the RNG used. It was in my original draft but somehow got left out. I'm using the default Matlab rand(n,1) Here is what the docs (R2008a) say:
"Use the Mersenne Twister algorithm by Nishimura and Matsumoto (the default in MATLABĀ® Versions 7.4 and later). This method generates double-precision values in the closed interval [2^(-53), 1-2^(-53)], with a period of (2^19937-1)/2."
My answer is here, which is based on the RNG information above: http://www.scribd.com/doc/48243415/The-Birthday-Paradox-and-Random-Number-Generation
A summary of my analysis is that the probability of duplicates rises significantly after n = 10^8 and has converged to 1 at n = 10^9.
Here are the results of testing, with 5 runs for each value of n.
n = 1*10^8, Prob = 0.426..., Duplicates = (1, 0, 1, 0, 1). Avg = 0.6
n = 5*10^8, Prob = 0.999..., Duplicates = (18, 18, 18, 16, 20). Avg = 16.0
n = 1*10^9, Prob = 1.000..., Duplicates = (66, 64,43, 48, 62). Avg = 56.6
I am not an expert on the mathematics or implementation of RNGs and tend to accept what the docs tell me until I'm told otherwise.
This is not "just a mathematics" question. I say this at the end of my notes above which starts with the real question behind my original question:
What are the implications of getting duplicates in long vectors? This question cannot be answered in general because the answer depends on how these random vectors are used. If they are being used to test a new sorting algorithm, for example, then it is difficult to see how a few duplicates would matter.
I would not be so sanguine about the use of such vectors in medical research. A lot of DNA research seems to involve large amounts of computation which may use large vectors.\footnote{I don't really know what is done, so this is purely speculative.} This is particularly troubling, given the appalling record that bio-medical researchers have in misusing and abusing probability, statistics, and software.
I would certainly be uneasy if I found out that the Boeing or Airbus I was flying in used long random vectors as part of their avionics software. Hence my often-repeated question to students:
Would you fly in a plane whose software was written by you? "
Regards,
Derek O'Connor.
------------------------
I'm replying to comments here because I can't use that tiny comment window.
|Dear Bruno, Walter, et al.
I realize that I have not been clear in my answers to your questions and that my statement that rand picks "integer multiples of 2^(-53)" is just plain wrong. Before I get into the floating point aspects of the problem (ugh!), let me state clearly what probability model the analysis is based on:
I assumed that r = rand(n,1) picks r1, r2,...,rn from a bag of N = 2^(53)-1 'objects', WITH replacement, independently, and with equal probability Pr(ri) = 1/N. In other words its the standard "random sample of size n from a population of size N, with replacement" model. The standard textbook analysis and results follow, giving 1-P(N,n) = Pr(1 or more duplicates).
I think these are reasonable assumptions to make about any good random number generator, in any programming language. Hence the analysis and probability calculations are not specific to Matlab, and carry over to any programming language. Of course, for each situation we need to know N, but that is the only parameter needed.
But what about the 'objects' in the bag? In this case they are double precision floating point numbers. But who cares? The Matlab program that counts the number of duplicates merely tests for the equality of two objects. These objects could be integers, singles, doubles, or even matrices.
But you sorted these objects before you tested for equality, and this implies a '<' on the set of objects. Yes, the '<' was used to reduce the search for duplicates to O(nlogn). If '<' does not apply to the bag of objects then we must use an O(n^2) search algorithm that uses '=' only.
And now to the floating point aspects of the problem, where I'll meet my Waterloo, more than likely.
An IEEE 64-bit double precision floating point number is represented as
fl(x) = +/- s x 2^e where s (significand) is a 52-bit integer and e (exponent) is an 11-bit integer and the sign is 1 bit. The precision is p = 53 bits, and machine epsilon emach = 2^(1-p) = 2^(-52) is the distance between fl(1)=1 and the next higher fpn. = 1+emach = 1+2^(-52). A floating point number is normalized if the first bit of s is 1. In base 2 this can be a hidden or implied bit which is not actually stored.
Distribution Properties:
  1. There are 2^(p-1) = 2^(52) normalized dpfpns in [1, 1-2^(-52)]
  2. These dpfpns are uniformly distributed with spacing emach = 2^(-52).
  3. These dpfpns have the form x = 1 + k*emach, k = 0,1,2,...,2^(52)-1.
These numbers have the following bit patterns, with fixed exponent 2^0:
k = 0 1.000...000 x 2^0
k = 1 1.000...001 x 2^0
...
k = 2^(52)-3 1.111...101 x 2^0
k = 2^(52)-2 1.111...110 x 2^0
k = 2^(52)-1 1.111...111 x 2^0
If we wanted uniform numbers in [1, 1-2^(-52)] we could randomly pick an integer k between 0 and 2^(52)-1 and return r = 1+k*emach. Alternatively, we may view a random number as one of the 2^(52) possible bit patterns (000...000) ... (111...111).
Matlab's rand however, picks dpfpns from [2^(-53), 1-2^(-53)]. If we assume that all significand bit patterns are generated then using 2 and 3 above we get
x = 2^(-53) + k*2^(-53)*emach = 2^(-53) + k*2^(-105), k = 0,1, 2^(52)-1.
But this does not give the correct result. To get numbers in the range [2^(-53), 1-2^(-53)] we need to change the exponent. As Walter pointed out 53 different exponents are used. And as Bruno pointed out this means non-uniformity.
Right now I'm stuck. I'm missing something obvious but I can't see what it is. Floating point Aghhhh!
Does this invalidate the Birthday problem analysis, calculations and Matlab test results? No, because I'm not assuming anything about the values of the random numbers generated. I test only for equality. However, I'm not sure: should N = 2^(52) or 2^(53)?
Regards,
Derek.
|