I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:
- Start with an initially empty "sampled set".
- Draw a (single) weighted sample with replacement with whatever method you have.
- Check whether you have already picked it.
- If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
- Repeat 2-4 until the sampled set is of the desired size.
Will this give me a valid weighted sample?
(Context: for very large samples, R's sample(x,n,replace=FALSE,p)
takes a really long time, much longer than implementing the strategy above.)
UPDATE 2012-01-05
Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).
First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:
get_rejection_sample <- function(replace_sample, n) {
top = length(replace_sample)
bottom = 1
mid = (top+bottom)/2
u = unique(replace_sample[1:mid])
while(length(u) != n) {
if (length(u) < n) {
bottom = mid
}
else {
top = mid
}
mid = (top+bottom)/2
u = unique(replace_sample[1:mid])
if (mid == top || mid == bottom) {
break
}
}
u
}
It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:
> x = 1:10**7
> w = rexp(length(x))
> system.time(y <- sample(x, 500, FALSE, w))
user system elapsed
12.820 0.048 12.902
> system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))
user system elapsed
0.311 0.066 0.378
Warning message:
In sample(x, 1000, TRUE, w) :
Walker's alias method used: results are different from R < 2.2.0
As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample $10^6$ indices from a list of about $1.2 \times 10^7$, which takes hours using replace=FALSE
.
Best Answer
You describe a sequence of rejection samples.
By definition, after selecting $k$ objects in the process of obtaining a weighted sample of $m$ objects without replacement from a list of $n$ objects, you draw one of the remaining $n-k$ objects according to their relative weights. In your description of the alternative sampling scheme, at the same stage you will be drawing from all $n$ objects according to their relative weights but you will throw back any object equal to one of the first $k$ and try again: that's the rejection. So all you have to convince yourself of is that the chance of drawing one of the remaining $n-k$ objects is the same in either case.
If this equivalence isn't perfectly clear, there's a straightforward mathematical demonstration. Let the weights of the first $k$ objects be $w_1, \ldots, w_k$ and let the weights of the remaining $n-k$ objects be $w_{k+1}, \ldots, w_n$. Write $w = w_1 + w_2 + \cdots + w_n$ and let $w_{(k)} = w_1 + w_2 + \cdots + w_k$. The chance of drawing object $i$ ($k \lt i \le n$) without replacement from the $n-k$ remaining objects is of course
$$\frac{w_i}{w_{k+1} + w_{k+2} + \cdots + w_n} =\frac{w_i}{w-w_{(k)}}.$$
In the alternative (rejection) scheme, the chance of drawing object $i$ equals
$$\sum_{a=0}^\infty \left[\left(\frac{w_{(k)}}{w}\right)^a \frac{w_i}{w}\right] =\frac{1}{1 - w_{(k)}/w}\frac{w_i}{w} = \frac{w_i}{w-w_{(k)}}$$
exactly as before. The sum arises by partitioning the event by the number of rejections ($a$) that precede drawing element $i$; its terms give the chance of a sequence of $a$ draws from the first $k$ elements followed by drawing the $i^\text{th}$ element: because these draws are independent, their chances multiply. It forms a geometric series which is elementary to put into closed form (the first equality). The second equality is a trivial algebraic reduction.