[Math] Probability of Choosing an Item in Weighted Random Sampling Without Replacement

probabilitysamplingstatistics

Consider the problem of randomly sampling $k$ distinct items from a population of $n$ items without replacement. If all items have the
same weight, then the probability that a specific item is among the $k$ selected items is $\binom{n-1}{k-1} / \binom{n}{k}$.

Now suppose that items are weighted and the probability of each item
being selected is determined by its relative weight:

Input: Set $N=\{1,\dots,n\}$ items with weights $W=\{w_1,\dots,w_n\}$
Output: Set $S$ of $k$ randomly selected items without replacement

Repeat k times

  • probability of $i \in N \setminus S$ being selected = $\frac{w_i}{\sum_{j\in N \setminus S}w_j}$
  • randomly select an item from $N \setminus S$ and add it to $S$.

What is the probability that a specific item is among the $k$ selected items in weighted random sampling without replacement?

Best Answer

Comment (and solution of a simple special case.) This has been here for a while, apparently without helpful comments. This appears to be a generalization of a 'multivariate hypergeometric' distribution.

You might start with a simplified set of weights. Let an urn contain balls labeled from 1 through 8. And suppose their respective weights are $w = (2, 2, 1, 1, 1, 1, 1, 1)/10.$ If you withdraw $k = 2$ balls from the urn without replacement, what is the probability you get the ball labeled '$1$'?

Get 1 on the first draw: $P(\text{1 on 1st}) = (2/10)(8/8) = .2.$

Get 1 on the second draw: Either 21, or something other than 1 or 2 on the first, then 1 on the second. $P(\text{2 then 1}) = (2/10)(2/8) = .05.$ $P(\text{3 then 1}) = (1/10)(2/9) = 2/90 \approx 0.0222.$ $P(\text{1 on 2nd}) = 0.05 + 6(2/90) \approx 0.05 + 0.1333 = 0.1833.$

Finally, $P(\text{1 in two draws}) \approx 0.2 + 0.1833 = 0.3833.$

Even this simple problem turned out to surprise me by its intricacy and lack of symmetry. But perhaps, you can find patterns to simplify more complicated outcomes.


R statistical software does weighted random sampling in a way that would allow you to check some of your analytic solutions. As a prototype, here is a simulation of the simple example just above. Results are mainly accurate to three places.

m = 10^6;  d1 = d2 = numeric(m)
n = 2;  pop = 1:8;  w = c(2,2,1,1,1,1,1,1)/10
for (i in 1:m)  {
   draw = sample(pop, n, prob=w)
   d1[i] = draw[1];  d2[i] = draw[2]  }
mean(d1 ==1 | d2 ==1)  # '|' signifies union
## 0.383483

round(table(d1)/m,3)
## d1
##     1     2     3     4     5     6     7     8 
## 0.200 0.199 0.100 0.100 0.100 0.100 0.100 0.100 
round(table(d2)/m,3)
## d2
##     1     2     3     4     5     6     7     8 
## 0.184 0.184 0.105 0.105 0.106 0.106 0.105 0.105 

round(table(d1,d2)/m,3)
##   d2
## d1     1     2     3     4     5     6     7     8
##  1 0.000 0.050 0.025 0.025 0.025 0.025 0.025 0.025
##  2 0.050 0.000 0.025 0.025 0.025 0.025 0.025 0.025
##  3 0.022 0.022 0.000 0.011 0.011 0.011 0.011 0.011
##  4 0.022 0.022 0.011 0.000 0.011 0.011 0.011 0.011
##  5 0.022 0.022 0.011 0.011 0.000 0.011 0.011 0.011
##  6 0.022 0.022 0.011 0.011 0.011 0.000 0.011 0.011
##  7 0.022 0.022 0.011 0.011 0.011 0.011 0.000 0.011
##  8 0.022 0.022 0.011 0.011 0.011 0.011 0.011 0.000
Related Question