Solved – Calculate the cumulative probability of multiple different probabilities P(1..k) with n trials where n > k

binomial distributionmultinomial-distributionprobability

I've been able to model the probability of multiple successes where the probability of each success is the same, but I am unable to figure out how to model the combined probability where the probabilities of individual successes are different.

By example:

If I roll 10 six-sided dice, the probability to roll five fours or more is a result of a cumulative binomial distribution:

Number of trials = n = 10
Probability of success = P = 0.5
Number of successes = x >= 5

Binomial Distribution: b(x; n, P) = nCx * P^x * (1 – P)^(n – x)

and the Cumulative Binomial Distribution:

b(x >= 5; 10, 0.5) = b(x = 0; 10, 0.5) + b(x = 1; 10, 0.5) + .. + b(x = 5; 10, 0.5) = 0.623046875

(source: http://stattrek.com/probability-distributions/binomial.aspx)

So far, so good. My problem now is figuring out how to calculate the probability if the target values are, for example: two fours or greater, two fives or greater and one six out of 10 trials.

Gut feel: The probability for each set of equivalent successes (sixes, fives and fours) should be evaluated in isolation and, once evaluated, the number of trials should be reduced by the size of the set. This also implies that the order makes a difference and for that purpose, I'm assuming that obtaining a six is a higher priority then obtaining a five, and so on. This would then mean that the probability of obtaining a six is calculated first using 10 trials, followed by the fives using 9 trials and the fours using 7 trials.

b(x >= 1; 10, 0.166667) = 0.838495063131323

b(x >= 2; 9, 0.333333) = 0.856932373512168

b(x >= 2; 7, 0.5) = 0.9375

What I don't know is how to put the individual probabilities back together again for a cumulative probability. This method also doesn't feel right, since it seems like you are conduction a total of 26 trials to achieve these results.

I sense I may be over-complicating the calculation since it may just be a normal cumulative binomial function with changing probabilities:

b(x >= 5; 10, {0.166667;0.333333;0.333333;0.5;0.5}) = b(x = 0; 10, 0.166667) + b(x = 1; 10, 0.333333) + b(x = 2; 10, 0.333333) + b(x = 3; 10, 0.5) + b(x = 4; 10, 0.5) + b(x = 5; 10, ???)

As you can see, I have no idea what the probability should be for the last term where x = 5… I'm assuming that the formula should iterate 6 times from 0 to 5, since that is the number of iterations for b(x >= 5; 10, 0.5).

I have not been able to find an equivalent question here and would appreciate any assistance!

Best Answer

In general, you have to compute these multinomial probabilities individually and add them.

Consider a die with $d$ sides, each appearing with probability $p_i, i=1, 2, \ldots, d$. The experiment consists of rolling this die $n$ times independently and recording the number of times $x_i$ that each face $i$ appears.

Suppose you want to find the chance that the $x_i$ simultaneously are one of the counts in a specified set $S_i$. For instance, the problem asks for the chance that $x_4\in S_4=\{2,3,\ldots,10\}$ (two or more fours), $x_5\in S_5=\{2,3,\ldots,10\}$ (two or more fives), $x_6\in S_6=\{1,2,\ldots,10\}$ (one or more sixes), and $x_1,x_2,x_3$ may have any value in $S_1=S_2=S_3=\{0,1,\ldots, 10\}$.

Let's abstractly name this chance $f(n, \mathbf{p}, \mathbf{S})$ (where $\mathbf p = (p_1,p_2, \ldots, p_d)$ and $\mathbf{S} = (S_1, S_2, \ldots, S_d)$).

The outcomes have a multinomial distribution. Thus

$$f(n, \mathbf{p}, \mathbf{S}) = \sum_{a_i\in S_i; a_1+\cdots+a_d=n} \binom{n}{a_1,a_2,\ldots,a_d}p_1^{a_1} p_2^{a_2} \cdots p_d^{a_d}$$

where the multinomial coefficient can be computed as

$$\binom{n}{a_1,a_2,\ldots,a_d} = \frac{n!}{a_1!a_2!\cdots a_d!}.$$

Evaluate this sum one index at a time. For instance, beginning with the last index $a_d$ we obtain

$$\eqalign{ f(n, \mathbf{p}, \mathbf{S}) &= \sum_{a_d\in S_d} \binom{n}{a_d} p_d^{a_d}\sum_{a_i\in S_i; a_1+\cdots+a_{d-1}=n-a_d} \binom{n-a_d}{a_1,a_2,\ldots,a_{d-1}}p_1^{a_1} p_2^{a_2} \cdots p_{d-1}^{a_{d-1}}\\ &= \sum_{a_d\in S_d} \binom{n}{a_d} p_d^{a_d} f(n-a_d, \mathbf{p}^\prime, \mathbf{S}^\prime) }$$

where the primes indicate the last element has been deleted from each vector. This is a recursive program. It gets started with the case $n=1$, where

$$f(n, p, S) = \left\{\matrix{p^n & \text{if } \in S\\ 0 & \text{otherwise.}}\right.$$

This is potentially a time-consuming calculation, because when the $S_i$ are numerous, there may be a great many cases to examine. For small $d$ and (depending on $d$) small $n$, it's practicable--the example in the question takes no measurable time. For large $d$, $n$, and numerous $S_i$, a cleverer algorithm is needed to screen out cases that contribute so little to the answer that their computation can be ignored.

Because the coefficients $\binom{n}{a_d}$ can be very large and the powers $p_d^{a_d}$ very small, it's a good idea to sum logarithms rather than perform these multiplications directly. That is the method implemented in the code below.

For the example described in the question, the result is $10\,593\,330/6^{10} \approx 0.1751943$. The following R code computed it. It includes a simulation afterwards to run the experiment a million times, and then compares the frequency with which $\mathbf x$ was found in $\mathbf S$ to the value previously computed. A Z-score is also returned: typically, a Z-score less than $2.5$ or so in size is significant confirmation, especially when the simulation is large enough to have observed this event many times. Here is its output:

Theory Simulated         Z 
0.1752    0.1752    0.0255 

The simulation agreed almost exactly with the computed value.

# Return the logarithm of the chance that a die described by probability
# vector `p`, after `n` independent rolls, will exhibit face counts given
# by the numbers in the (nonempty) list `s`.
#
f <- function(n, p, s) {
  if (length(s)==1) return(ifelse(n %in% s[[1]], n * log(p[1]), -Inf)) # Base case

  a <- s[[1]]
  a <- a[0 <= a & a <= n]
  if (length(a)==0) return(-Inf) # There's nothing satisfying the criteria

  z <-  a * log(p[1]) + lchoose(n, a) + sapply(a, function(a) f(n-a, p[-1], s[-1]))
  return(log(sum(exp(z))))
}
#
# Examples and testing.
#
n <- 10
d <- 6
p <- rep(1/d, d)
s <- list(0:n, 0:n, 0:n, 2:n, 2:n, 1:n)
# s <- lapply(p, function(i) sample(0:n, ceiling(0.9*n))) # For testing
theory <- exp(f(n, p, s))
#
# Simulation.
#
n.sample <- 1e6
set.seed(17)
x <- rmultinom(n.sample, n, p)
e <- paste(sapply(1:length(s), function(i) paste0("x[", i, ", ] %in% s[[", i, "]]")), 
           collapse=" & ")
s <- eval(parse(text=paste0("mean(", e, ")")))
round(c(Theory=theory, Simulated=s, 
        Z=(s-theory) / sqrt(theory*(1-theory)/n.sample)), ceiling(log10(n.sample)/2+1))