[Math] Probability distribution of defective parts

probability distributions

Suppose there are 1 million parts which have 1% defective parts i.e 1 million parts have 10000 defective parts. Now suppose we are taking different sample sizes from 1 million like 10%, 30%, 50%, 70%, 90% of 1 million parts and we need to calculate the probability of finding maximum 5000 defective parts from these sample sizes. As 1 million parts has 1% defective parts so value of success p is 0.01 and failure q is 0.99. Now the issue is when we r calculating probability of sample sizes below 50% of 1 million parts, value of probability for finding maximum 5000 defective parts is always 0, at 50% of 1 million parts it is 0.5 and sample sizes of more than 50% give probability equal to 1. It means we only get three probability values in all sample sizes i.e 0, 0.5, 1. Now the issue is that there are no intermediate values between 0-0.5 or 0.5-1 although sample size is changing linearly. Can someone plz mention the issue in this problem. I will be very gratful

Best Answer

The problem: First we state what we understand to be the problem. The probability that a randomly chosen item is defective is $0.01$. We take possibly very large samples, samples that are enormous by essentially all sampling standards. We want to find the probability that the number of defectives in the sample is $\le 5000$,

A comment mentions the binomial distribution. It is not clear that this is the appropriate distribution. If the number of defectives is exactly $10000$, then we are dealing with the hypergeometric distribution. However, for the kinds of calculations we are making, for all practical purposes it doesn't matter.

The normal approximation: Let random variable $X$ be the number of bads in a sample of size $n$. We want $\Pr(X\le 5000)$.

The random variable $X$ has mean $n(0.01)$ and variance $n(0.01)(0.99)$. So the standard deviation is $\sqrt{n}\sqrt{(0.01)(0.99)}$. For most practical purposes this is $\sqrt{n}/10$. (This number would require modification for another probability of a bad.)

The probability that $X\le 5000$ is extremely well approximated by $$\Pr\left(Z\le \frac{5000-(0.01)n}{\sqrt{n}/10}\right),\tag{1}$$ where $Z$ is standard normal.

Some calculations: We do some calculations, for the various huge sample sizes mentioned in the OP.

$10$ percent: By (1), we want the probability that $Z\le 126.5$. This is $1$ for all practical and impractical purposes.

$30$ percent: By (1), we want the probability that $Z\le 36.5$. Again, this is $1$ for all practical purposes.

$40$ percent: Our probability is approximately the probability that $Z\le 15.8$. Again this is $1$ for all practical purposes.

$50$ percent: We want the probability that $Z\le 0$. This is $0.05$.

$60$ percent: We want the probability that $Z\le -15.8$. This is virtually equal to $0$.

Higher percentages give the same result, essentially $0$.

More calculations: The probability that $X\le 5000$ was virtually $1$ for the various percentages we calculated, up to but not including $50\%$, became $0.5$ at $50\%$, and had already dropped to virtually $0$ at $60\%$. We do some exploration of the fine stucture around $50\%$.

Look for example at $49\%$. Again by (1), the probability that $X\le 5000$ is approximately the probability that $Z\le 1.428$. This is about $0.92$, significantly away from $1$. By symmetry, with a sample size of $51\%$, we have $\Pr(X\le 5000)\approx 0.08$.

Finally, let's do the computation for $49.5\%$. We get $\Pr(X\le 5000)\approx 0.76$.

Remark: So interesting stuff happens only if we are looking at samples near the $50\%$ range. The phenomenon around $50\%$ is almost a "$0$-$1$ phenomenon, though on closer examination the transition turns out to be smooth.

Note that the normal approximation is not always appropriate for problems like this. For example, with the same probability $0.01$ of bad, To find the probability of exactly $3$ bad in a sample of $500$ I would suggest either the Poisson approximation to the binomial, or direct calculation of the binomial.

Generalities: We look at a more general situation. Let the population size be $N$, with $N$ large (in your case $N$ is $10^6$). Let the probability of a bad be $p$, where $p$ is small (in your case $p=0.01$.) If $X$ is the number of bads in a sample of size $n$, then the standard deviation of $X$ is $\sqrt{np(1-p)}$.

The mean number of bads in the population is $pN$. Let $n=\alpha N$. We are looking at very large sample sizes $n$. The number $\alpha$ is the ratio $\frac{n}{N}$. For instance, if we are looking at a $40\%$ sample, then $\alpha=0.4$.

The expected number of bads in the population is $Np$. Your problem is to find the probability that $X\le \frac{1}{2}Np$. Note that $$\Pr(X\le \frac{1}{2}Np)=\Pr\left(\frac{X-np}{\sqrt{np(1-p}}\le \frac{\frac{1}{2}Np-np}{\sqrt{np(1-p)}}\right).\tag{2}$$

The random variable $\frac{X-np}{\sqrt{np(1-p)}}$ is close to standard normal. Replace $n$ by $\alpha N$. After some simplification we find that we want $$\Pr\left(Z \le \frac{1}{\sqrt{\alpha}\sqrt{p(1-p)}}\left(\frac{1}{2} -\alpha\right)\sqrt{N}\right).\tag{3}$$
We can do further simplification. For large $N$ and small $p$, the probability will be nearly one for $\alpha \lt 1/2$ but not too close to $1/2$, and nearly $0$ for $\alpha \gt 1/2$ but not too close to $1/2$. And $\sqrt{1-p}$ is close to $1$. So our probability is well approximated by $$\Pr\left(Z\le\left(\frac{1}{2}-\alpha\right)\sqrt{2pN}\right).\tag{4}$$

This formula is easy to work with. As an example, from the normal tables we find that $\Pr(Z\le 4)\approx 0.999$. Let $p=0.01$ and $N=10^6$. Let's find out what $\alpha$ should be so that $\Pr(X\le 5000)\approx 0.999$. Calculation gives $\alpha=0.472$, about a $47\%$ sample.

Conclusion: Unless the sample size proportion is quite close to half the population size, the probability is for all practical purposes fully determined. There is not, however, a sudden shift at $50\%$. The shift is indeed rapid. With your numbers, for all practical purposes the only interesting interval is the one from about $47\%$ to $53\%$. For similar situations with different numbers (but $N$ still large, and $p$ small), Formula (4) should give very good quality estimates.

Perhaps the simplest general explanation of the phenomenon is that for $p$ of the size we have been looking at, or smaller, the variance of $X$ is relatively low. For $p=0.01$, it is about one-tenth of what the variance would be for $p=1/2$. Thus even for the $40\%$ case, $4000$ is a lot of standard deviation units away from $5000$.

Related Question