[Math] Simple random sampling without replacement of huge dataset

randomsampling

For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 million(and growing).

Usually I use the technique of seeing if a uniform random number r (0..1) is less than $S \over T$, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.

However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at that position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)

I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.

  • To be honest, approximately unbiased would be OK.

  • This is related (more or less a follow-on) to this persons question:

Simple random sample without replacement

  • One more side question… the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?

EDIT:

Any ideas about solutions that don't involve holding a separate array, or mutating the original? I'd like to just roll a die, skip, select, then roll a die again. Otherwise the problem becomes O(S) for the selections, and O(S) for the data requirements. If my sample size grows over time (and I know it will), I'm kind of back to square one. And in my practical solution, I can only read the source data (although I could stand up something in front of it… still….)

I am sorry about changing the parameters.

I was thinking about it some, and in the spirit of the other solution, all's I really have to do is determine where the first selection of S should land. Then I can forget that value and everything before it, and repeat the process until all the values are selected.

So… let's say I have 100 tennis balls and I'm going to drop them fairly, at most one per bucket, into 60 million lined-up paint buckets. What would be the lowest bucket with a ball after I dropped them all? If I can model that in an unbiased way, then it just becomes rinse and repeat.

Best Answer

A classical algorithm to generate a random permutation of an array (with uniform probability) can be adapted for this. Let us assume that your data is stored in an array.

  • Let data_size be the size of your data
  • Let curr_sample_size be the number of samples taken so far, initialized at $0$
  • While curr_sample_size is strictly lower than the desired sample size $S$, do
    • Let pos be a random position between curr_sample_size and data_size$-1$
    • swap the elements at position curr_sample_size and pos in your data array
    • increase the value of curr_sample_size by $1$
  • Return the curr_sample_size first values of your data array

If you need to leave your data array in the same order, you can record the list of pos you generated, and swap back the array afterwards, once you recovered all the samples at the beginning of the array. Given the list of positions, putting back the array into place is done with

  • Let curr_sample_pos be initialized from the desired sample size to $S-1$
  • For each position pos from the last one to the first one, do
    • swap the elements at position pos and curr_sample_pos in the data array
    • decrease the value of curr_sample_pos by $1$
Related Question