[Math] Probability distribution of defective parts

probability distributions

Suppose there are 1 million parts which have 1% defective parts i.e 1 million parts have 10000 defective parts. Now suppose we are taking different sample sizes from 1 million like 10%, 30%, 50%, 70%, 90% of 1 million parts and we need to calculate the probability of finding maximum 5000 defective parts from these sample sizes. As 1 million parts has 1% defective parts so value of success p is 0.01 and failure q is 0.99. Now the issue is when we r calculating probability of sample sizes below 50% of 1 million parts, value of probability for finding maximum 5000 defective parts is always 0, at 50% of 1 million parts it is 0.5 and sample sizes of more than 50% give probability equal to 1. It means we only get three probability values in all sample sizes i.e 0, 0.5, 1. Now the issue is that there are no intermediate values between 0-0.5 or 0.5-1 although sample size is changing linearly. Can someone plz mention the issue in this problem. I will be very gratful

Best Answer

The problem: First we state what we understand to be the problem. The probability that a randomly chosen item is defective is $0.01$. We take possibly very large samples, samples that are enormous by essentially all sampling standards. We want to find the probability that the number of defectives in the sample is $\le 5000$,

A comment mentions the binomial distribution. It is not clear that this is the appropriate distribution. If the number of defectives is exactly $10000$, then we are dealing with the hypergeometric distribution. However, for the kinds of calculations we are making, for all practical purposes it doesn't matter.

The normal approximation: Let random variable $X$ be the number of bads in a sample of size $n$. We want $\Pr(X\le 5000)$.

The random variable $X$ has mean $n(0.01)$ and variance $n(0.01)(0.99)$. So the standard deviation is $\sqrt{n}\sqrt{(0.01)(0.99)}$. For most practical purposes this is $\sqrt{n}/10$. (This number would require modification for another probability of a bad.)

The probability that $X\le 5000$ is extremely well approximated by $$\Pr\left(Z\le \frac{5000-(0.01)n}{\sqrt{n}/10}\right),\tag{1}$$ where $Z$ is standard normal.

Some calculations: We do some calculations, for the various huge sample sizes mentioned in the OP.

$10$ percent: By (1), we want the probability that $Z\le 126.5$. This is $1$ for all practical and impractical purposes.

$30$ percent: By (1), we want the probability that $Z\le 36.5$. Again, this is $1$ for all practical purposes.

$40$ percent: Our probability is approximately the probability that $Z\le 15.8$. Again this is $1$ for all practical purposes.

$50$ percent: We want the probability that $Z\le 0$. This is $0.05$.

$60$ percent: We want the probability that $Z\le -15.8$. This is virtually equal to $0$.

Higher percentages give the same result, essentially $0$.

More calculations: The probability that $X\le 5000$ was virtually $1$ for the various percentages we calculated, up to but not including $50\%$, became $0.5$ at $50\%$, and had already dropped to virtually $0$ at $60\%$. We do some exploration of the fine stucture around $50\%$.

Look for example at $49\%$. Again by (1), the probability that $X\le 5000$ is approximately the probability that $Z\le 1.428$. This is about $0.92$, significantly away from $1$. By symmetry, with a sample size of $51\%$, we have $\Pr(X\le 5000)\approx 0.08$.

Finally, let's do the computation for $49.5\%$. We get $\Pr(X\le 5000)\approx 0.76$.

Remark: So interesting stuff happens only if we are looking at samples near the $50\%$ range. The phenomenon around $50\%$ is almost a "$0$-$1$ phenomenon, though on closer examination the transition turns out to be smooth.

Note that the normal approximation is not always appropriate for problems like this. For example, with the same probability $0.01$ of bad, To find the probability of exactly $3$ bad in a sample of $500$ I would suggest either the Poisson approximation to the binomial, or direct calculation of the binomial.

Generalities: We look at a more general situation. Let the population size be $N$, with $N$ large (in your case $N$ is $10^6$). Let the probability of a bad be $p$, where $p$ is small (in your case $p=0.01$.) If $X$ is the number of bads in a sample of size $n$, then the standard deviation of $X$ is $\sqrt{np(1-p)}$.

The mean number of bads in the population is $pN$. Let $n=\alpha N$. We are looking at very large sample sizes $n$. The number $\alpha$ is the ratio $\frac{n}{N}$. For instance, if we are looking at a $40\%$ sample, then $\alpha=0.4$.

The expected number of bads in the population is $Np$. Your problem is to find the probability that $X\le \frac{1}{2}Np$. Note that $$\Pr(X\le \frac{1}{2}Np)=\Pr\left(\frac{X-np}{\sqrt{np(1-p}}\le \frac{\frac{1}{2}Np-np}{\sqrt{np(1-p)}}\right).\tag{2}$$

The random variable $\frac{X-np}{\sqrt{np(1-p)}}$ is close to standard normal. Replace $n$ by $\alpha N$. After some simplification we find that we want $$\Pr\left(Z \le \frac{1}{\sqrt{\alpha}\sqrt{p(1-p)}}\left(\frac{1}{2} -\alpha\right)\sqrt{N}\right).\tag{3}$$
We can do further simplification. For large $N$ and small $p$, the probability will be nearly one for $\alpha \lt 1/2$ but not too close to $1/2$, and nearly $0$ for $\alpha \gt 1/2$ but not too close to $1/2$. And $\sqrt{1-p}$ is close to $1$. So our probability is well approximated by $$\Pr\left(Z\le\left(\frac{1}{2}-\alpha\right)\sqrt{2pN}\right).\tag{4}$$

This formula is easy to work with. As an example, from the normal tables we find that $\Pr(Z\le 4)\approx 0.999$. Let $p=0.01$ and $N=10^6$. Let's find out what $\alpha$ should be so that $\Pr(X\le 5000)\approx 0.999$. Calculation gives $\alpha=0.472$, about a $47\%$ sample.

Conclusion: Unless the sample size proportion is quite close to half the population size, the probability is for all practical purposes fully determined. There is not, however, a sudden shift at $50\%$. The shift is indeed rapid. With your numbers, for all practical purposes the only interesting interval is the one from about $47\%$ to $53\%$. For similar situations with different numbers (but $N$ still large, and $p$ small), Formula (4) should give very good quality estimates.

Perhaps the simplest general explanation of the phenomenon is that for $p$ of the size we have been looking at, or smaller, the variance of $X$ is relatively low. For $p=0.01$, it is about one-tenth of what the variance would be for $p=1/2$. Thus even for the $40\%$ case, $4000$ is a lot of standard deviation units away from $5000$.

Related Solutions

[Math] Calculate expectation from empirical cdf

Depending on how you've got your empirical cdf (discretized, formula) you might opt for Henning's answer, or also (given that the variable is positive) use this identity (obtained by applying integration by parts) :

$$E(X) = \int_0^\infty (1- F(x)) \; dx$$

[Math] Probability question involving simulations of picking balls from a bag

All ${100 \choose 30}$ patterns are equally likely. There is probably a better approach to your first question, but this produces a nice answer:

Suppose that $R(r,b)$ is the expected number of runs given that the first ball is red and $B(r,b)$ is the expected number of runs given that the first ball is blue, in each case when there are $r$ red balls and $b$ blue balls. Then $R(r,0)=1$ and $B(0,b)=1$ and by symmetry $R(r,b)=B(b,r)$. More generally we have the recurrence $$R(r,b)=\frac{r-1}{r+b-1} R(r-1,b) + \frac{b}{r+b-1} (1+R(b,r-1))$$ which has the solution $R(r,b)= \dfrac{2rb+r-1}{r+b-1}$ with $R(1,0)=1$. So the expected number of runs overall when there are $r$ red balls and $b$ blue balls is $$\frac{r}{r+b}R(r,b)+\frac{b}{r+b}R(b,r)=1+\frac{2rb}{r+b}.$$

In your question you had $r=70$ and $b=30$, giving an expected number of runs of $43$.

It is slightly unfortunate that $43$ is odd, as it complicates the distribution of runs (e.g. there might be $21$ or $22$ red runs with probabilities $0.3$ and $0.7$).

Added: In general, if there $n$ balls of a particular colour and they are distributed between $k$ positive runs, then a simple stars and bars argument says that the probability that a given run is of length $M$ is $$\Pr(M=m|n,k)= \dfrac{\displaystyle {n-m-1 \choose k-2}}{\displaystyle {n-1 \choose k-1}}$$ the each of the $k$ runs has a identical (but not independent) distribution. So for example if you have $70$ red balls in $22$ runs or $30$ blue balls in $21$ runs, the probability distribution for the length $M$ in a red run or of a blue run would be about

m   red 70,22   blue 30,21
1   0.3043478   0.6896552
2   0.2148338   0.2216749
3   0.1507043   0.0656814
4   0.1050363   0.0176835
5   0.0727174   0.0042440
6   0.0499932   0.0008842
7   0.0341224   0.0001538
8   0.0231152   0.0000210
9   0.0155364   0.0000020
10  0.0103576   0.0000001
11  0.0068466   not possible
12  0.0044857   
13  0.0029118   
14  0.0018718   
15  0.0011912   
16  0.0007500   
17  0.0004670   
18  0.0002874   
19  0.0001747   
20  0.0001048   
21  (smaller)

But we can go further than that, and see the distribution of runs unconstrained by how many runs there are. As an example consider two red balls and one blue ball: the patterns $R_2B_1$, $R_1B_1R_1$, and $B_1R_2$ are equally likely and so with repeated experiments over time you expect to see as many runs of $R_2$ as $R_1$ in this particular case. More generally, if there are $r$ balls and $b$ blue balls, the expected proportion of red runs which would be of length $m$ in that sense, is

$$ \dfrac{b}{r} \dfrac{ {r \choose m} }{ {r+b-1 \choose m}}$$

where reversing $r$ and $b$ would give the expected proportion of blue runs of length $m$ in that sense. With $r=70$ and $b=30$ this would give about

m   red runs    blue runs
1   0.3030303   0.7070707
2   0.2133581   0.2092352
3   0.1495706   0.0603978
4   0.1043878   0.0169869
5   0.0725221   0.0046490
6   0.0501482   0.0012364
7   0.0345106   0.0003191
8   0.0236323   0.0000798
9   0.0161011   0.0000193
10  0.0109130   0.0000045
11  0.0073571   0.0000010
12  0.0049326   0.0000002
13  0.0032884   (smaller)
14  0.0021795   
15  0.0014359   
16  0.0009402   
17  0.0006117   
18  0.0003954   
19  0.0002538   
20  0.0001618   
21  (smaller)

which is not that far away from the constrained case.

Best Answer

Related Solutions

[Math] Calculate expectation from empirical cdf

[Math] Probability question involving simulations of picking balls from a bag

Related Question