Probability Theory – How to Choose Specific Item in Weighted Sampling Without Replacement

probabilityprobability theorysampling

Given $n$ items with weight $w_n$ each — what is the probability that item $i$ is chosen in a $k$-out-of-$n$ "weighted random sampling without replacement" experiment? Can a closed-form solution that depends only on $w_i / w_\cdot$ be derived ($w_\cdot = \sum_j w_j$.)?

EDIT: A solution that depends only on $w_i / w_\cdot$ is impossible. Assume $n=3$, $i=1$, $w_1 = 1$, and two cases: (1) $w_2 = 1, w_3 = 1 + \varepsilon$, (2) $w_2 = 2, w_3 = \varepsilon > 0$. In case (1) the probability is almost $2/3$, in case (2) it is almost $1$, but in both cases $w_1 = 1$ and $w_\cdot = 3 + \varepsilon$.

What I have tried so far to solve the problem:

Let $P^n_k(w, i)$ be the probability that item $i$ is chosen in an $k$-out-of-$n$ experiment with weight vector $w$.
In the first draw, the item is chosen with probability $w_i / w_\cdot$. Otherwise, we are looking for the probability to choose this item in a $k-1$-out-of-$n-1$ experiment, with the same weight vector except for the item that has been selected. Hence:

$$
P^n_k(w, i) = w_i / w_\cdot + \sum_{j \neq i} w_j / w_\cdot \cdot P^{n-1}_{k-1}(w – j, i)
$$
$$
= w_\cdot^{-1} \left( w_i + \sum_{j \neq i} w_j \cdot P^{n-1}_{k-1}(w – j, i) \right)
$$

with $w-j$ being "the vector $w$ without the $j$th element".

How to solve this recurrence relation? (If it is correct at all…)

Best Answer

I'm pulling this from Pavlos S. Efraimidis, Paul G. Spirakis, Weighted random sampling with a reservoir, Information Processing Letters, Volume 97, Issue 5, 16 March 2006, Pages 181-185, ISSN 0020-0190, 10.1016/j.ipl.2005.11.003.

There, the authors begin by describing a basic weighted random sampling algorithm with the following definition:

Input: A population $V$ of $n$ weighted items

Output: A set $S$ with a WRS of size m

Repeat Steps 2 and 3 for $k=1,2,\ldots,m$

The probability of $v_i$ to be selected is:$$p_i(k) = \frac{w_i} {\sum_{S_j \in V - S} {w_j}}$$

Randomly select an item $v_k \in V - S$ and insert it into $S$

The authors go on to explain how they arrive at the probability, but I'll summarize. Starting with the first item, the probability that $w_n$ is selected is its own weight divided by the sum of all weights. $$\frac{w_n}{w_1 + w_2+w_3+\ldots+w_n}$$ Easy enough. Now, the probability that each subsequent item will be chosen is its own weight divided by the sum of the remaining weights. If we do this calculation for each weight $w_i$ in order with $i=[1,n]$, we arrive at the authors' final summary equation for any permutation $\Pi$:

$$ P(\Pi) = \prod_{i=1}^{n} {\frac{w_i} {w_1 + w_2 +\ldots+w_i}} $$

Which is to say, the probability that an item is chosen can be defined as indexes of an array $w$ that contains all weights, like so: $$\frac{w(i)}{w([1,i])}$$

This isn't particularly difficult work, but I didn't want to create the proof myself, hence the quoting of Efraimidis & Spirakis.

Related Solutions

[Math] Probability of Choosing an Item in Weighted Random Sampling Without Replacement

Comment (and solution of a simple special case.) This has been here for a while, apparently without helpful comments. This appears to be a generalization of a 'multivariate hypergeometric' distribution.

You might start with a simplified set of weights. Let an urn contain balls labeled from 1 through 8. And suppose their respective weights are $w = (2, 2, 1, 1, 1, 1, 1, 1)/10.$ If you withdraw $k = 2$ balls from the urn without replacement, what is the probability you get the ball labeled '$1$'?

Get 1 on the first draw: $P(\text{1 on 1st}) = (2/10)(8/8) = .2.$

Get 1 on the second draw: Either 21, or something other than 1 or 2 on the first, then 1 on the second. $P(\text{2 then 1}) = (2/10)(2/8) = .05.$ $P(\text{3 then 1}) = (1/10)(2/9) = 2/90 \approx 0.0222.$ $P(\text{1 on 2nd}) = 0.05 + 6(2/90) \approx 0.05 + 0.1333 = 0.1833.$

Finally, $P(\text{1 in two draws}) \approx 0.2 + 0.1833 = 0.3833.$

Even this simple problem turned out to surprise me by its intricacy and lack of symmetry. But perhaps, you can find patterns to simplify more complicated outcomes.

R statistical software does weighted random sampling in a way that would allow you to check some of your analytic solutions. As a prototype, here is a simulation of the simple example just above. Results are mainly accurate to three places.

m = 10^6;  d1 = d2 = numeric(m)
n = 2;  pop = 1:8;  w = c(2,2,1,1,1,1,1,1)/10
for (i in 1:m)  {
   draw = sample(pop, n, prob=w)
   d1[i] = draw[1];  d2[i] = draw[2]  }
mean(d1 ==1 | d2 ==1)  # '|' signifies union
## 0.383483

round(table(d1)/m,3)
## d1
##     1     2     3     4     5     6     7     8 
## 0.200 0.199 0.100 0.100 0.100 0.100 0.100 0.100 
round(table(d2)/m,3)
## d2
##     1     2     3     4     5     6     7     8 
## 0.184 0.184 0.105 0.105 0.106 0.106 0.105 0.105 

round(table(d1,d2)/m,3)
##   d2
## d1     1     2     3     4     5     6     7     8
##  1 0.000 0.050 0.025 0.025 0.025 0.025 0.025 0.025
##  2 0.050 0.000 0.025 0.025 0.025 0.025 0.025 0.025
##  3 0.022 0.022 0.000 0.011 0.011 0.011 0.011 0.011
##  4 0.022 0.022 0.011 0.000 0.011 0.011 0.011 0.011
##  5 0.022 0.022 0.011 0.011 0.000 0.011 0.011 0.011
##  6 0.022 0.022 0.011 0.011 0.011 0.000 0.011 0.011
##  7 0.022 0.022 0.011 0.011 0.011 0.011 0.000 0.011
##  8 0.022 0.022 0.011 0.011 0.011 0.011 0.011 0.000

Conditional probability of two dependent continuous random variables

Let $\{W_i\}$ be set of $W_i$ for $i=2,3,4,5$.

$V_1$ and $V_2$ are not independent. But there are conditionally independent given $\{W_i\}$!. Therefore:

$$p(V_1, V_2 \lvert \{W_i\}) = p(V_1 \lvert \{W_i\}) p(V_2 \lvert \{W_i\})$$

From law of total probability:

$$p(V_1, V_2 ) = \int p(V_1, V_2 \lvert \{W_i\}) p( \{ W_i \} )$$

Then you can easily compute $p(V_2)$ since $\{W_i\}$ are mutually independent Gaussians, so sum of them is another Gaussian, $\sum W_i = N(\sum \mu_i, \sum \sigma^2_i)$, where $\mu_i, \sigma^2_i$ are means and variances of $W_i$.

Then, from definition: $$p(V_1 \lvert V_2) = p(V_1, V_2) /p(V_2)$$

Best Answer

Related Solutions

[Math] Probability of Choosing an Item in Weighted Random Sampling Without Replacement

Conditional probability of two dependent continuous random variables

Related Question