Consider the problem of randomly sampling $k$ distinct items from a population of $n$ items without replacement. If all items have the
same weight, then the probability that a specific item is among the $k$ selected items is $\binom{n-1}{k-1} / \binom{n}{k}$.
Now suppose that items are weighted and the probability of each item
being selected is determined by its relative weight:
Input: Set $N=\{1,\dots,n\}$ items with weights $W=\{w_1,\dots,w_n\}$
Output: Set $S$ of $k$ randomly selected items without replacementRepeat k times
- probability of $i \in N \setminus S$ being selected = $\frac{w_i}{\sum_{j\in N \setminus S}w_j}$
- randomly select an item from $N \setminus S$ and add it to $S$.
What is the probability that a specific item is among the $k$ selected items in weighted random sampling without replacement?
Best Answer
Comment (and solution of a simple special case.) This has been here for a while, apparently without helpful comments. This appears to be a generalization of a 'multivariate hypergeometric' distribution.
You might start with a simplified set of weights. Let an urn contain balls labeled from 1 through 8. And suppose their respective weights are $w = (2, 2, 1, 1, 1, 1, 1, 1)/10.$ If you withdraw $k = 2$ balls from the urn without replacement, what is the probability you get the ball labeled '$1$'?
Get
1
on the first draw: $P(\text{1 on 1st}) = (2/10)(8/8) = .2.$Get
1
on the second draw: Either21
, or something other than1
or2
on the first, then1
on the second. $P(\text{2 then 1}) = (2/10)(2/8) = .05.$ $P(\text{3 then 1}) = (1/10)(2/9) = 2/90 \approx 0.0222.$ $P(\text{1 on 2nd}) = 0.05 + 6(2/90) \approx 0.05 + 0.1333 = 0.1833.$Finally, $P(\text{1 in two draws}) \approx 0.2 + 0.1833 = 0.3833.$
Even this simple problem turned out to surprise me by its intricacy and lack of symmetry. But perhaps, you can find patterns to simplify more complicated outcomes.
R statistical software does weighted random sampling in a way that would allow you to check some of your analytic solutions. As a prototype, here is a simulation of the simple example just above. Results are mainly accurate to three places.