Probability – Conditional Probability of Finding a Defective Item Amongst $k\times m$ Items

bayes-theoremconditional probabilityprobability

There are $k$ packages, each with $m$ items. One of the $k \cdot m$ items is a defect. To find the defect, $n$ items are randomly selected from each package. I wish to determine the probabilities that (a) the defect is in the first package ($B_1$), given that it is not found in the first package ($F_1^c$), and (b) the defect is in the second package ($B_2$), given that it is not found in the first package ($F_1^c$).

To do so, I have determined the following:

(a) Intuitively, $P(B_1) = 1/k$ (because each box has an equal probability of containing the defect), $P(F_1|B_1) = n/m$ (because we're sampling $n$ out of $m$ items) and $P(F_1) = n/(mk)$ (from the law of total probability).
Thus, Bayes' Rule leads to
\begin{align} P(B_1|F_1^c) = \frac{P(F_1^c|B_1)P(B_1)}{P(F_1^c)} = \frac{(1-n/m)(1/k)}{1-n/mk}. \end{align}
(b) Intuitively, $P(B_2) = 1/k$ and $P(F_1^c|B_2) = 1$, because, if the object is located in package 2, then it will not be found in package 1. Again, Bayes' Rule leads to,
\begin{align} P(B_2|F_1^c) = \frac{P(F_2^c|B_1)P(B_2)}{P(F_1^c)} = \frac{1/k}{1-n/mk}, \end{align}
both of which final expressions can be simplified a little. Is this reasoning to determine the conditional probabilities correct?

Best Answer

$$ P(B_1\mid F^c_1) = \frac{P(B_1, F^c_1)}{P(F^c_1)} =\frac{P(F^c_1\mid B_1)P(B_1)}{P(F^c_1)}. $$ If the defect is certainly found in $B_1$, the probability is proportional to the fraction inspected, or: the number of objects found are $Hypergeo(m, 1, n)$, $$ P(F_1\mid B_1) = n/m. $$ Further, $$ P(F_1) = P(F_1\mid B_1)P(B_1) + P(F_1\mid B^c_1)P(B_1) = \frac{n}{m}\frac{1}{k} + 0 = \frac{n}{mk} $$ so, since $P(B_j) = 1/k$, $$ P(B_1\mid F^c_1) = \frac{(1 - n/m)/k}{1 - n/(mk)} = \frac{m-n}{km -n} $$ i.e. "remaining objects in 1" / "total objects remaining".

The second part is also the same: $$ P(B_2\mid F^c_1) = \frac{P(B_2, F^c_1)}{P(F^c_1)} =\frac{P(F^c_1\mid B_2)}{P(F^c_1)}. $$ $P(F^c_1\mid B_2) = 1 - 0$ so $$ P(B_2\mid F^c_1) = \frac{1/k}{1 - n/(mk)} = \frac{m}{km -n} $$ or, "objects in 2"/"total objects remaining".

In conclusion, these answers agree with the original question and with intuition.

Another, perhaps simplifying, formulation is to set two vectors: observed $X$ and unobserved $Y$. There are $m$ copies of each and exactly one takes the value 1 and all others are zero. Define $M=mk$ and set $n_y = m-n$ so that $P(Y_k=1) = n_y/M$ and $P(X_k=1) = n/M.$

This formulation separates the boxes from each other so $P(B_1\mid F^c_1) = P(Y_1 = 1\mid X_1 = 0) = n_y/(M-n)$. This can be visualized as the area occupied by a single $Y$ divided by the area that the 1 could be in. Finally, $$ P(B_2\mid F^c_1) = P(X_2 + Y_2 = 1\mid X_1 = 0) = (n_y+n)/(M-n) = m/(km-n). $$

Related Solutions

[Math] Probability of Choosing an Item in Weighted Random Sampling Without Replacement

Comment (and solution of a simple special case.) This has been here for a while, apparently without helpful comments. This appears to be a generalization of a 'multivariate hypergeometric' distribution.

You might start with a simplified set of weights. Let an urn contain balls labeled from 1 through 8. And suppose their respective weights are $w = (2, 2, 1, 1, 1, 1, 1, 1)/10.$ If you withdraw $k = 2$ balls from the urn without replacement, what is the probability you get the ball labeled '$1$'?

Get 1 on the first draw: $P(\text{1 on 1st}) = (2/10)(8/8) = .2.$

Get 1 on the second draw: Either 21, or something other than 1 or 2 on the first, then 1 on the second. $P(\text{2 then 1}) = (2/10)(2/8) = .05.$ $P(\text{3 then 1}) = (1/10)(2/9) = 2/90 \approx 0.0222.$ $P(\text{1 on 2nd}) = 0.05 + 6(2/90) \approx 0.05 + 0.1333 = 0.1833.$

Finally, $P(\text{1 in two draws}) \approx 0.2 + 0.1833 = 0.3833.$

Even this simple problem turned out to surprise me by its intricacy and lack of symmetry. But perhaps, you can find patterns to simplify more complicated outcomes.

R statistical software does weighted random sampling in a way that would allow you to check some of your analytic solutions. As a prototype, here is a simulation of the simple example just above. Results are mainly accurate to three places.

m = 10^6;  d1 = d2 = numeric(m)
n = 2;  pop = 1:8;  w = c(2,2,1,1,1,1,1,1)/10
for (i in 1:m)  {
   draw = sample(pop, n, prob=w)
   d1[i] = draw[1];  d2[i] = draw[2]  }
mean(d1 ==1 | d2 ==1)  # '|' signifies union
## 0.383483

round(table(d1)/m,3)
## d1
##     1     2     3     4     5     6     7     8 
## 0.200 0.199 0.100 0.100 0.100 0.100 0.100 0.100 
round(table(d2)/m,3)
## d2
##     1     2     3     4     5     6     7     8 
## 0.184 0.184 0.105 0.105 0.106 0.106 0.105 0.105 

round(table(d1,d2)/m,3)
##   d2
## d1     1     2     3     4     5     6     7     8
##  1 0.000 0.050 0.025 0.025 0.025 0.025 0.025 0.025
##  2 0.050 0.000 0.025 0.025 0.025 0.025 0.025 0.025
##  3 0.022 0.022 0.000 0.011 0.011 0.011 0.011 0.011
##  4 0.022 0.022 0.011 0.000 0.011 0.011 0.011 0.011
##  5 0.022 0.022 0.011 0.011 0.000 0.011 0.011 0.011
##  6 0.022 0.022 0.011 0.011 0.011 0.000 0.011 0.011
##  7 0.022 0.022 0.011 0.011 0.011 0.011 0.000 0.011
##  8 0.022 0.022 0.011 0.011 0.011 0.011 0.011 0.000

[Math] Conditional probability and testing twice

Denote events $D$, $\bar D$ as the patient having the defect and not having the defect, respectively. Let $P_i$ denote the event that test $i$ is positive, and $\bar P_i$ the event that test $i$ is negative.

Then the desired probability is $$\Pr[P_2 \mid P_1] = \frac{\Pr[P_2 \cap P_1]}{\Pr[P_1]} = \frac{\Pr[P_2 \cap P_1 \mid D]\Pr[D] + \Pr[P_2 \cap P_1 \mid \bar D]\Pr[\bar D]}{\Pr[P_1 \mid D]\Pr[D] + \Pr[P_1 \mid \bar D]\Pr[\bar D]}.$$ Since $P_i$ are conditionally independent given the defect status, we have $$\Pr[P_2 \cap P_1 \mid D] = (\Pr[P_i \mid D])^2.$$ Then, given $$\Pr[D] = 0.01, \quad \Pr[P_i \mid D] = 0.999, \quad \Pr[P_i \mid \bar D] = 0.05,$$ we easily obtain $$\Pr[P_2 \mid P_1] = \frac{(0.999)^2(0.01) + (0.05)^2(1-0.01)}{(0.999)(0.01) + (0.05)(1-0.01)} \approx 0.209363.$$ This number is small because the prevalence of defects is so rare, and the false positive rate is much higher than the prevalence. Therefore, a positive result is more likely to result from a false positive, and a second test is not terribly likely to come back positive.

Best Answer

Related Solutions

[Math] Probability of Choosing an Item in Weighted Random Sampling Without Replacement

[Math] Conditional probability and testing twice

Related Question