Suppose I have N probabilities $(p_1, p_2,…,p_N)$ that represent the chance that each that a corresponding test was passed. How do I apply the Bernoulli distribution to determine the expected number of passes?
Solved – Expected number of successes from $N$ Bernoulli trials with different $p$
bernoulli-distributionpoisson-binomial-distribution
Related Solutions
The distribution you are asking about is called the Poisson Binomial distribution, with rather complicated pmf (see Wikipedia for broader description)
$$ \Pr(X=x) = \sum\limits_{A\in F_x} \prod\limits_{i\in A} p_i \prod\limits_{j\in A^c} (1-p_j) $$
Generally, the problem is that using it directly for a larger number of trials would be very slow. There are also other methods of calculating the pmf, e.g. recursive formulas, but they are numerically unstable. The easiest way around those problems are approximation methods (described e.g. by Hong, 2013). If we define
$$ \mu = \sum_{i=1}^n p_i $$
$$ \sigma = \sqrt{ \sum_{i=1}^n p_i(1-p_i) } $$
$$ \gamma = \sigma^{-3} \sum_{i=1}^n p_i (1 - p_i) (1 - 2p_i)$$
then we can approximate pmf with Poisson distribution via law of small numbers or Le Cams theorem
$$ \Pr(X = x) \approx \frac{\mu^x \exp(-\mu)}{x!} $$
but it sees that generally Binomial approximation behaves better (Choi and Xia, 2002)
$$ \Pr(X = x) \approx \mathrm{Binom} \left( n, \frac{\mu}{n} \right)$$
you can use Normal approximation
$$ f(x) \approx \phi \left( \frac{x + 0.5 - \mu}{ \sigma} \right) $$
or cdf can be approximated using so-called refined Normal approximation (Volkova, 1996)
$$ F(x) \approx \max\left(0, \ g \left( \frac{x + 0.5 - \mu}{ \sigma} \right) \right)$$
where $g(x) = \Phi(x) + \gamma(1-x^2) \frac{\phi(x)}{6}$.
Another alternative is of course a Monte Carlo simulation.
Simple dpbinom
R function would be
dpbinom <- function(x, prob, log = FALSE,
method = c("MC", "PA", "NA", "BA"),
nsim = 1e4) {
stopifnot(all(prob >= 0 & prob <= 1))
method <- match.arg(method)
if (method == "PA") {
# poisson
dpois(x, sum(prob), log)
} else if (method == "NA") {
# normal
dnorm(x, sum(prob), sqrt(sum(prob*(1-prob))), log)
} else if (method == "BA") {
# binomial
dbinom(x, length(prob), mean(prob), log)
} else {
# monte carlo
tmp <- table(colSums(replicate(nsim, rbinom(length(prob), 1, prob))))
tmp <- tmp/sum(tmp)
p <- as.numeric(tmp[as.character(x)])
p[is.na(p)] <- 0
if (log) log(p)
else p
}
}
Most of the methods (and more) are also implemented in R poibin package.
Chen, L.H.Y. (1974). On the Convergence of Poisson Binomial to Poisson Distributions. The Annals of Probability, 2(1), 178-180.
Chen, S.X. and Liu, J.S. (1997). Statistical applications of the Poisson-Binomial and conditional Bernoulli distributions. Statistica Sinica 7, 875-892.
Chen, S.X. (1993). Poisson-Binomial distribution, conditional Bernoulli distribution and maximum entropy. Technical Report. Department of Statistics, Harvard University.
Chen, X.H., Dempster, A.P. and Liu, J.S. (1994). Weighted finite population sampling to maximize entropy. Biometrika 81, 457-469.
Wang, Y.H. (1993). On the number of successes in independent trials. Statistica Sinica 3(2): 295-312.
Hong, Y. (2013). On computing the distribution function for the Poisson binomial distribution. Computational Statistics & Data Analysis, 59, 41-51.
Volkova, A. Y. (1996). A refinement of the central limit theorem for sums of independent random indicators. Theory of Probability and its Applications 40, 791-794.
Choi, K.P. and Xia, A. (2002). Approximating the number of successes in independent trials: Binomial versus Poisson. The Annals of Applied Probability, 14(4), 1139-1148.
Le Cam, L. (1960). An Approximation Theorem for the Poisson Binomial Distribution. Pacific Journal of Mathematics 10(4), 1181–1197.
I would say that the closest to your description is the multivariate hypergeometric distribution
$$ f(k_1,\dots,k_c) = \frac{\prod_{i=1}^c {K_i \choose k_i}}{ N\choose n} $$
where you sample $n$ marbles appearing in $c$ colors, where you have $K_i$ marbles of each color, so you can sample no more then $N = \sum_i K_i$ marbles in total.
Since you seem to be asking about the total number of successes (as in Poisson-binomial distribution in your example), then I understand that you count $c-1$ colors as "successes" and the $c$-th color as failure. To obtain distribution of total number of "successes" in such case you can notice that marginally multivariate hypergeometric distribution follows a hypergeometric distribution, so $k_\text{tot} = \sum_{i=1}^{c-1} k_i$ follows a hypergeometric distribution where you sample $n$ marbles from the urn containing $K_\text{tot} = \sum_{i=1}^{c-1} K_i $, since it is the same as subtracting $k_c$ from $n$, where $k_c$ itself is drawn from univariate hypergeometric distribution.
Saying this in plain English: simply imagine that you painted all the $1,\dots,c-1$ colorful marbles in black leaving the $c$-th marble untouched, so now you are drawing from the urn containing only two types of marbles (black vs $c$-th) and you do not need to care about all the kinds of marbles you started with.
So the case with finite number of marbles and drawing without replacement is simple. On another hand, if you are dealing with infinite numbers of marbles, then sampling with replacement does not differ from sampling without replacement, so you are dealing with Poisson-binomial distribution.
Notice that initial probabilities of drawing the balls are $p_i = K_i/N$, so if you are dealing with finite number of balls, then knowing the initial probabilities is the same as knowing the counts.
Best Answer
If you add up $n$ Bernouilli random variables with a different success probabilities then the sum has a Binomial Distribution of Poisson.
If the success probabilities are $p_i$ then the mean of the Binomial of Poisson is $n \bar{p}$ where $\bar{p}=\frac{\sum_i p_i}{n}$ and the variance of the Binomial distribution of Poisson is $n \bar{p} (1-\bar{p}) - n \times var(p)$ (Note: for $var(p)$ there must be $n$ in the denominator!).
Note that if $var(p)$ is relatively small then the variance reduces to a binomial variance, which was expected because a small $var(p)$ means that the success probabilities of the Bernouillis are more or less equal.
I have some code to simulate this:
In the return of the R-function there is
(length(ps)-1)*var(ps)
this is because the R-functionvar()
has (n-1) in the denominator. so $-n \times var(p)$ in the formula above should be 'translated' to-length(ps) * ( var(ps) * (length(ps)-1)/length(ps)
which becomes- var(ps) * (length(ps)-1)
See also this Intuitive explanation for dividing by $n-1$ when calculating standard deviation?