MATLAB: Randsample issues and generating random numbers from large populations

populationrandomrandsamplesamplinguniform

The randsample function is pretty handy — I supply a population in a vector and a number of samples I'd like drawn uniformly without replacement. Usually my population is a simple, increasing set of numbers, like 1 to 100000.
But I'm working now on a much larger problem where I'd like to generate, say, 500,000 samples from the population 1:2^60, or something huge. The problem is that the first thing randsample does is create a vector x = zeros(1,2^60), which of course breaks immediately as the maximum variable size is violated.
My question is this — if my needs are simple enough, is it possible to generate the samples without storing the entire population in a vector? Could I generate them one at a time? Or could I somehow get the vector of samples as before but without storing the population?
A cursory idea was to generate smaller random numbers and concatenate them, but then I'm not sure what I'm losing on the theoretical side…
Any advice, statistical or MATLAB would be appreciated.

Best Answer

Warning #1: the below has potentially unlimited running time, with the expected running time increasing exponentially as the sample size approaches the population size.
SS = 500000; %number of samples wanted
Population = 2^60; %see warning #2!!
chunk = ceil(Population * rand(1,SS)); %includes duplicates
[uchunk, ua, ub] = unique(chunk, 'first');
chunk = uchunk(ua); %random order and duplicates after first removed
lc = length(chunk);
while lc < SS
newchunk = ceil(Population * rand(1,SS-lc));
[uchunk, ua, ub] = unique(newchunk, 'first');
newchunk = uchunk(ua);
newchunk( ismember(newchunk, chunk) ) = [];
chunk = [chunk, newchunk];
lc = length(chunk);
end
samples = chunk;
Warning #2: the above code will not work when the population size exceeds 2^53, which is about 10^15. rand() generates from 1 * 2^(-53) to (2^53 - 1) * 2^(-53) -- only (2^53 - 2) different choices.
Generating over a larger population is not just a matter of using a version of rand() with higher resolution: at the upper end of the range, near 1, rand() is already as high a resolution as is supported by double precision floating point numbers. It thus becomes necessary to replace the entire
ceil(Population * rand())
technique. Doing that is practical up to a population size of 2^64 (the range of uint64), which gets you to about a population size of about 1.8E19. If there is not already something in the MATLAB File Exchange to generate random integers, you can add an interface onto MT19937-64.c
If you were to need to go beyond 2^64 for your population size, you would start having to use multiple integers to represent the sample numbers, which would complicate things somewhat.