For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 million(and growing).
Usually I use the technique of seeing if a uniform random number r (0..1) is less than $S \over T$, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.
However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at that position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)
I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.
-
To be honest, approximately unbiased would be OK.
-
This is related (more or less a follow-on) to this persons question:
Simple random sample without replacement
- One more side question… the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?
EDIT:
Any ideas about solutions that don't involve holding a separate array, or mutating the original? I'd like to just roll a die, skip, select, then roll a die again. Otherwise the problem becomes O(S) for the selections, and O(S) for the data requirements. If my sample size grows over time (and I know it will), I'm kind of back to square one. And in my practical solution, I can only read the source data (although I could stand up something in front of it… still….)
I am sorry about changing the parameters.
I was thinking about it some, and in the spirit of the other solution, all's I really have to do is determine where the first selection of S should land. Then I can forget that value and everything before it, and repeat the process until all the values are selected.
So… let's say I have 100 tennis balls and I'm going to drop them fairly, at most one per bucket, into 60 million lined-up paint buckets. What would be the lowest bucket with a ball after I dropped them all? If I can model that in an unbiased way, then it just becomes rinse and repeat.
Best Answer
A classical algorithm to generate a random permutation of an array (with uniform probability) can be adapted for this. Let us assume that your data is stored in an array.
data_size
be the size of your datacurr_sample_size
be the number of samples taken so far, initialized at $0$curr_sample_size
is strictly lower than the desired sample size $S$, dopos
be a random position betweencurr_sample_size
anddata_size
$-1$curr_sample_size
andpos
in your data arraycurr_sample_size
by $1$curr_sample_size
first values of your data arrayIf you need to leave your data array in the same order, you can record the list of
pos
you generated, and swap back the array afterwards, once you recovered all the samples at the beginning of the array. Given the list of positions, putting back the array into place is done withcurr_sample_pos
be initialized from the desired sample size to $S-1$pos
from the last one to the first one, dopos
andcurr_sample_pos
in the data arraycurr_sample_pos
by $1$