Probability Estimation – How to Efficiently Estimate Individuals with n+ Successes from Bernoulli Trials

bernoulli-distributionestimationpoisson-binomial-distributionprobability

I have a situation where I need to estimate the number of persons exposed to a given event n or more times. For each person, I have an array of probabilities representing independent Bernoulli trials where a successful trial means 1 exposure of that individual to the event in question.

My data, for instance, looks something like this:

Person A: [0.34, 0.89, 0.01, 0.50]
Person B: [0.22, 0.45, 0.60]
Person C: [0.10, 0.10, 0.99, 0.32, 0.22]

To make it more concrete, if we pretend that I knew the outcome of the trials above as:

Person A: [0, 1, 1, 1]
Person B: [0, 1, 1]
Person C: [0, 0, 1, 0, 0]

…and the n I am calculating against is 2, then my answer is "2 people" because Person A and Person B were exposed to the event 2 or more times.

I know that the number of successful events for a series of independent Bernoulli trials (with different probabilities) is given by the Poisson Binomial Distribution, so for any given person, I could use standard methods for that distribution.

However, my understanding is that most methods of such have troubles or are untenable at a large number of probabilities. In my case, the array of probabilities for a given person may wildly vary (say 2 to 200 or more), and the number of persons can be prohibitively large (millions). Additionally I'm not just estimating "the number of events," but rather "the number of people with n+ events."

Is there a way to efficiently estimate (so for instance, not monte carlo) the above even if the estimate has an understood error propensity?
For instance, at the moment I'm running some simulations where I simply sum up the probabilities per person and count up the persons where that is greater than my n which is sometimes close to the result a monte carlo would give, but I don't understand the error behavior yet or if there is a better method.

Best Answer

I'm not convinced that the stipulated size for your data is too large for exact computation using the Poisson-binomial distribution. Nevertheless, since you are looking for some reasonable approximation method, you are probably going to have to fall back onto the normal approximation from the CLT (which ought to be slightly better then what you are doing now. Alternatively, you could consider using a hybrid method that I'll explain below.

Suppose we have a Poisson-binomial random variable $X \sim \text{PoisBin}(\boldsymbol{\theta})$ with probabilities $\boldsymbol{\theta} = (\theta_1,...,\theta_k)$ for $k$ independent trials. We can write this as $X = \sum_i U_i$ where $U_1,...,U_k \sim \text{IID Bern}(\theta_i)$, which allows us to easily compute the mean and variance:

$$\begin{align} \mathbb{E}(X) &= \mathbb{E} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt] &= \sum_{i=1}^k \mathbb{E} ( U_i ) \\[6pt] &= \sum_{i=1}^k \theta_i, \\[6pt] \mathbb{V}(X) &= \mathbb{V} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt] &= \sum_{i=1}^k \mathbb{V} ( U_i ) \\[6pt] &= \sum_{i=1}^k \theta_i(1-\theta_i). \\[6pt] \end{align}$$

Now, if $k$ is not too small, we can invoke the CLT to approximate the distribution by a normal distribution with the same mean and variance, giving the resulting approximation:

$$X \overset{\text{approx}}{\sim} \text{N}(\mu, \sigma^2) \quad \quad \quad \quad \quad \mu \equiv \sum_{i=1}^k \theta_i \quad \quad \quad \quad \quad \sigma^2 \equiv \sum_{i=1}^k \theta_i(1-\theta_i).$$

There are various ways you can obtain the specific approximating distribution you want to use (e.g., using continuous approximation with or without continuity correction, discrete approximation, etc.) and these will give you different formulae for the approximate probability of interest. I quite like to use the approximation obtained by taking the density at the exact discrete points and then normalising, which gives:

$$\mathbb{P}(X \geqslant r) \approx \frac{\sum_{x=r}^k \text{N}(x|\mu, \sigma^2)}{\sum_{x=0}^k \text{N}(x|\mu, \sigma^2)}.$$

(Note here that I use the notation $r$ instead of $n$ for the cut-off value of interest; it is bad practice to use the latter as notation, since it is the standard notation for the number of data points in the analysis.) It ought to be computationally feasible to compute this approximate probability for each of your rows. Each computation requires you to first compute $\mu$ and $\sigma^2$ from the above formulae and then compute the approximate probability


Implementation in R: You can implement this approximation in R quite simply using the existing facilities for the normal distribution. It is best to undertake computations in log-space for accuracy. Here is a function that will compute the approximate probability/log-probability for an input probability vector theta and input value n (with an option log to return the result as a log-probability instead of a probability).

#Create a function to compute the approximate probability
prob.approx <- function(theta, r, log = FALSE) {
  k <- length(theta)
  if (r > k) { if (log) { return(-Inf) } else { return(0) } }
  MEAN <- sum(theta)
  VAR  <- sum(theta*(1-theta))
  LOGS <- dnorm(0:k, mean = MEAN, sd = sqrt(VAR), log = TRUE)
  LOGP <- matrixStats::logSumExp(LOGS[(r+1):(k+1)]) - matrixStats::logSumExp(LOGS)
  if (log) { LOGP } else { exp(LOGP) } }

We can experiment to see how fast this runs using the microbenchmark package. First I will set up some mock data where the number of elements in the probability vectors is Poisson distributed with a mean of 40 values. Then I will run a benchmarking test on a procedure that computes the approximating probability for $n=1000$ values (using a value for $r$ that is usually, but not always, below the number of trials).

#Set up some mock data (don't count this part in speed test)
set.seed(1)
n    <- 1000
LAM  <- 40
LENG <- rpois(n, lambda = LAM)
DATA <- vector(n, mode = 'list')
for (i in 1:n) { DATA[[i]] <- runif(LENG[i]) }

#Set up procedure
PROC <- function(data, r, log = FALSE) {
  n   <- length(data)
  OUT <- rep(0, n)
  for (i in 1:n) { OUT[i] <-  prob.approx(data[[i]], r, log = log) }
  OUT }

#Benchmark the procedure
library(microbenchmark)
set.seed(1)
microbenchmark(PROC(DATA, 30))

Unit: milliseconds
           expr     min      lq     mean   median      uq     max neval
 PROC(DATA, 30) 12.0072 12.2244 12.95248 12.27545 12.6327 26.7406   100

As can be seen from the output of this benchmarking test, it took about 13 milliseconds to compute the approximating probabilities for $n=1000$ data values. Consequently, it should be possible to compute for $n=10^6$ values in about 13 seconds. While I have not used quite the same spread of vector lengths as what you describe in your question, this method ought to be computationally feasible in your problem.