Solved – Uniform sampling of a set of weighted samples

algorithmsdiscrete dataprobabilitysamplingweighted-sampling

Consider a two-stage sampling scheme: First, use weighted random selection from a list to obtain a set of N unique elements. Next, use uniform random selection to pick one of those elements.

How can I counter the skew caused by the uniqueness condition when obtaining the first weighted set, so that the output selection frequency corresponds well with the weights defined for each element?

I obtain the first set by repeating a simple weighted sampling: Construct a cumulative list of weights, select a random integer in [0, sum(cumweights)), find corresponding index in cumulative list of weights, then return the corresponding element. I repeat this N times, where N is the number of elements in the list. Then I throw away duplicates.

Here's the comparison between keeping and throwing away duplicates, when I run a simulation 1'000'000 times:

Element  Weight  W/duplicates  WO/duplicates
A             5         0.050          0.066
B            10         0.100          0.125
C            20         0.200          0.219
D            30         0.300          0.283
E            35         0.350          0.307

The application here is a DNS load balancer with a configured set of (IP,weight) tuples. A query should return a set of unique IPs to the client in such a way that over time, each IP is selected with a probability corresponding to its weight — as if a single sample was taken. The various DNS resolvers involved in the process (OS stub resolver, local recursor, etc) can be assumed to shuffle the IP set uniformly (although this is resolver-specific).

Best Answer

A second answer

This answer is suitable if the number of distinct weights is small compared to the number of observations.

I found the relation of your example to your actual data variables somewhat unclear. Therefore, I'm going to formulate your problem in the following way. Items have some measure of "size" $m$. It it not the "weight" that you show in your example, because that weight depends on what other items are present. I'll group items into "types" $i = 1,2\ldots I$, such that items of type $i$ have the same size measure $m_i$.

Assumed:

  1. There are $N$ items in the population and $N_i$ items of type $i$, both known.

  2. $N_i\gg 1$. (There are no very rare types.)

Define $M_i = m_i \times N_i$, the total of sizes be contributed by items of type $i$. Define $M_+ = \sum_i M_i$, the total of all sizes in the population.

Let the relative contribution of items of type $i$ to $M_+$ be:

$$ R_i = M_i/M_+ $$

Define $n$ to be the desired size of the sample you want to deliver to the client.

To assure that the relative contribution of type $i$ items in the sample is $R_i$, the number of sample observations of type $i$ must be:

$$ n_i = \left[ R_i \times n \right] $$

where $\left[ x \right]$ denotes the integer nearest to x.

So the sampling method will be: From the $N_i$ items of type $i$, select a $n_i$ items by simple random sampling without replacement.

Note: If any of the quantities above, especially $R_i$, are not known in advance, estimate them by drawing a preliminary sample.

Related Question