Probability – How to Calculate Error for Estimated Number of Events with Known Probabilities?

confidence intervalestimationprobability

I have a long series of entities:

x₁, x₂, x₃, … x_n

for each of which, there is a probability of an event occurring. The probability for each x may be different, but they are all independent, and they are all known:

p₁, p₂, p₃, … p_n

I might provide a point-estimate for how many of said event occurred in that series by simply summing the probabilities p, but is it possible to also calculate a confidence interval or some information about the error in a close-form fashion?

The resistance to using some sampling method is that the series of x many be prohibitively long, and there may be many such series upon which the same estimates need be delivered.

EDIT – Refactoring of the question

For a toy example of the problem, imagine I have a large batch of n widgets, each of which has a known probability of failing within the year. Each widget may have a different known probability of failing, some high, some low (say they belong to different widget classes, those classes well-studied in the past). I'm tasked with estimating the number of widgets that will fail at the end of the year for that batch.

I believe I can provide a point estimate by simply summing the probabilities of each widget in the batch. Stakeholders also request a confidence interval, which may not be possible, but I should be able to provide a sense of error/uncertainty/variance around that point estimate.

My understanding is that I can treat my widgets failing/not-failing as a series of Bernoulli trials, and that likely the "number that fail" can be modelled as a Poisson Binomial Distribution, but from there I get stuck.

Best Answer

I wanted to sum up what ultimately I've found works well for this case and roll up what's been discussed in some of the comments.

Since I am basically wanting to know how many events occurred out of series of independent Bernoulli trials, we can model this as a Poisson Binomial distribution.

The mean of that distribution (which is my estimate) is the sum of all the probabilities.

The variance of that distribution is also known: the sum of all (1 - p_i)p_i.

That already gives me the standard deviation (the square root of the variance), so I've gotten a little bit of what the stakeholder wants.

We know that often the Poisson Binomial approaches normal with larger number of parameters (constituent probabilities from the trials), and luckily because of the nature of my problem, I will almost always have a large number of parameters. When that's the case, we can get an interval by simply taking the mean +/- 1.96 * sdev.

That's not an assumption I can just make easily though, so I did a bit of modelling to find out find out what are the typical probabilities for a given widget batch (the example scenario above), and what are the more extreme probability distributions, each at the different likely parameter/probability sizes.

I then calculated the Poisson Binomial mean, sdev, and 95% intervals using the approaches noted and ran monte carlo against them, doing this many times (with different draws from the middle and the extremes) to try and get as much coverage of the possible probability distributions I could run into. I compared the mean, sdev, and interval estimates against those retrieved from monte carlo by their percent error (treating the monte carlo result as the "right" answer).

Long story short, the mean and sdev values are in line with Poisson Binomial quite well. And the confidence intervals, while possibly being up to 5% off in the more extreme cases, are well within our business need for low parameter sizes, and have almost no error for higher parameter sizes (which is most of our cases).

So ultimately I used what I can from the known moments from the Poisson Binomial distribution directly, and then back those into intervals assuming normality (which works well enough for our specific use case).

Related Solutions

Probability Estimation – How to Efficiently Estimate Individuals with n+ Successes from Bernoulli Trials

I'm not convinced that the stipulated size for your data is too large for exact computation using the Poisson-binomial distribution. Nevertheless, since you are looking for some reasonable approximation method, you are probably going to have to fall back onto the normal approximation from the CLT (which ought to be slightly better then what you are doing now. Alternatively, you could consider using a hybrid method that I'll explain below.

Suppose we have a Poisson-binomial random variable $X \sim \text{PoisBin}(\boldsymbol{\theta})$ with probabilities $\boldsymbol{\theta} = (\theta_1,...,\theta_k)$ for $k$ independent trials. We can write this as $X = \sum_i U_i$ where $U_1,...,U_k \sim \text{IID Bern}(\theta_i)$, which allows us to easily compute the mean and variance:

$$\begin{align} \mathbb{E}(X) &= \mathbb{E} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt] &= \sum_{i=1}^k \mathbb{E} ( U_i ) \\[6pt] &= \sum_{i=1}^k \theta_i, \\[6pt] \mathbb{V}(X) &= \mathbb{V} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt] &= \sum_{i=1}^k \mathbb{V} ( U_i ) \\[6pt] &= \sum_{i=1}^k \theta_i(1-\theta_i). \\[6pt] \end{align}$$

Now, if $k$ is not too small, we can invoke the CLT to approximate the distribution by a normal distribution with the same mean and variance, giving the resulting approximation:

$$X \overset{\text{approx}}{\sim} \text{N}(\mu, \sigma^2) \quad \quad \quad \quad \quad \mu \equiv \sum_{i=1}^k \theta_i \quad \quad \quad \quad \quad \sigma^2 \equiv \sum_{i=1}^k \theta_i(1-\theta_i).$$

There are various ways you can obtain the specific approximating distribution you want to use (e.g., using continuous approximation with or without continuity correction, discrete approximation, etc.) and these will give you different formulae for the approximate probability of interest. I quite like to use the approximation obtained by taking the density at the exact discrete points and then normalising, which gives:

$$\mathbb{P}(X \geqslant r) \approx \frac{\sum_{x=r}^k \text{N}(x|\mu, \sigma^2)}{\sum_{x=0}^k \text{N}(x|\mu, \sigma^2)}.$$

(Note here that I use the notation $r$ instead of $n$ for the cut-off value of interest; it is bad practice to use the latter as notation, since it is the standard notation for the number of data points in the analysis.) It ought to be computationally feasible to compute this approximate probability for each of your rows. Each computation requires you to first compute $\mu$ and $\sigma^2$ from the above formulae and then compute the approximate probability

Implementation in R: You can implement this approximation in R quite simply using the existing facilities for the normal distribution. It is best to undertake computations in log-space for accuracy. Here is a function that will compute the approximate probability/log-probability for an input probability vector theta and input value n (with an option log to return the result as a log-probability instead of a probability).

#Create a function to compute the approximate probability
prob.approx <- function(theta, r, log = FALSE) {
  k <- length(theta)
  if (r > k) { if (log) { return(-Inf) } else { return(0) } }
  MEAN <- sum(theta)
  VAR  <- sum(theta*(1-theta))
  LOGS <- dnorm(0:k, mean = MEAN, sd = sqrt(VAR), log = TRUE)
  LOGP <- matrixStats::logSumExp(LOGS[(r+1):(k+1)]) - matrixStats::logSumExp(LOGS)
  if (log) { LOGP } else { exp(LOGP) } }

We can experiment to see how fast this runs using the microbenchmark package. First I will set up some mock data where the number of elements in the probability vectors is Poisson distributed with a mean of 40 values. Then I will run a benchmarking test on a procedure that computes the approximating probability for $n=1000$ values (using a value for $r$ that is usually, but not always, below the number of trials).

#Set up some mock data (don't count this part in speed test)
set.seed(1)
n    <- 1000
LAM  <- 40
LENG <- rpois(n, lambda = LAM)
DATA <- vector(n, mode = 'list')
for (i in 1:n) { DATA[[i]] <- runif(LENG[i]) }

#Set up procedure
PROC <- function(data, r, log = FALSE) {
  n   <- length(data)
  OUT <- rep(0, n)
  for (i in 1:n) { OUT[i] <-  prob.approx(data[[i]], r, log = log) }
  OUT }

#Benchmark the procedure
library(microbenchmark)
set.seed(1)
microbenchmark(PROC(DATA, 30))

Unit: milliseconds
           expr     min      lq     mean   median      uq     max neval
 PROC(DATA, 30) 12.0072 12.2244 12.95248 12.27545 12.6327 26.7406   100

As can be seen from the output of this benchmarking test, it took about 13 milliseconds to compute the approximating probabilities for $n=1000$ data values. Consequently, it should be possible to compute for $n=10^6$ values in about 13 seconds. While I have not used quite the same spread of vector lengths as what you describe in your question, this method ought to be computationally feasible in your problem.

Best Answer

Related Solutions

Probability Estimation – How to Efficiently Estimate Individuals with n+ Successes from Bernoulli Trials

Related Question