R – How to Sample Without Replacement Using a Sampling-With-Replacement Function?

rrejection samplingsampling

I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:

  1. Start with an initially empty "sampled set".
  2. Draw a (single) weighted sample with replacement with whatever method you have.
  3. Check whether you have already picked it.
  4. If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
  5. Repeat 2-4 until the sampled set is of the desired size.

Will this give me a valid weighted sample?

(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)

UPDATE 2012-01-05

Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).

First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:

get_rejection_sample <- function(replace_sample, n) {
    top = length(replace_sample)
    bottom = 1
    mid = (top+bottom)/2
    u = unique(replace_sample[1:mid])
    while(length(u) != n) {
        if (length(u) < n) {
            bottom = mid
        }
        else {
            top = mid
        }
        mid = (top+bottom)/2
        u = unique(replace_sample[1:mid])
        if (mid == top || mid == bottom) {
            break
        }
    }
    u
}

It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:

> x = 1:10**7
> w = rexp(length(x))
> system.time(y <- sample(x, 500, FALSE, w))
   user  system elapsed 
 12.820   0.048  12.902 
> system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))
   user  system elapsed 
  0.311   0.066   0.378 
Warning message:
In sample(x, 1000, TRUE, w) :
  Walker's alias method used: results are different from R < 2.2.0

As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample $10^6$ indices from a list of about $1.2 \times 10^7$, which takes hours using replace=FALSE.

Best Answer

You describe a sequence of rejection samples.

By definition, after selecting $k$ objects in the process of obtaining a weighted sample of $m$ objects without replacement from a list of $n$ objects, you draw one of the remaining $n-k$ objects according to their relative weights. In your description of the alternative sampling scheme, at the same stage you will be drawing from all $n$ objects according to their relative weights but you will throw back any object equal to one of the first $k$ and try again: that's the rejection. So all you have to convince yourself of is that the chance of drawing one of the remaining $n-k$ objects is the same in either case.

If this equivalence isn't perfectly clear, there's a straightforward mathematical demonstration. Let the weights of the first $k$ objects be $w_1, \ldots, w_k$ and let the weights of the remaining $n-k$ objects be $w_{k+1}, \ldots, w_n$. Write $w = w_1 + w_2 + \cdots + w_n$ and let $w_{(k)} = w_1 + w_2 + \cdots + w_k$. The chance of drawing object $i$ ($k \lt i \le n$) without replacement from the $n-k$ remaining objects is of course

$$\frac{w_i}{w_{k+1} + w_{k+2} + \cdots + w_n} =\frac{w_i}{w-w_{(k)}}.$$

In the alternative (rejection) scheme, the chance of drawing object $i$ equals

$$\sum_{a=0}^\infty \left[\left(\frac{w_{(k)}}{w}\right)^a \frac{w_i}{w}\right] =\frac{1}{1 - w_{(k)}/w}\frac{w_i}{w} = \frac{w_i}{w-w_{(k)}}$$

exactly as before. The sum arises by partitioning the event by the number of rejections ($a$) that precede drawing element $i$; its terms give the chance of a sequence of $a$ draws from the first $k$ elements followed by drawing the $i^\text{th}$ element: because these draws are independent, their chances multiply. It forms a geometric series which is elementary to put into closed form (the first equality). The second equality is a trivial algebraic reduction.

Related Question