I'm not convinced that the stipulated size for your data is too large for exact computation using the Poisson-binomial distribution. Nevertheless, since you are looking for some reasonable approximation method, you are probably going to have to fall back onto the normal approximation from the CLT (which ought to be slightly better then what you are doing now. Alternatively, you could consider using a hybrid method that I'll explain below.
Suppose we have a Poisson-binomial random variable $X \sim \text{PoisBin}(\boldsymbol{\theta})$ with probabilities $\boldsymbol{\theta} = (\theta_1,...,\theta_k)$ for $k$ independent trials. We can write this as $X = \sum_i U_i$ where $U_1,...,U_k \sim \text{IID Bern}(\theta_i)$, which allows us to easily compute the mean and variance:
$$\begin{align}
\mathbb{E}(X)
&= \mathbb{E} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt]
&= \sum_{i=1}^k \mathbb{E} ( U_i ) \\[6pt]
&= \sum_{i=1}^k \theta_i, \\[6pt]
\mathbb{V}(X)
&= \mathbb{V} \bigg( \sum_{i=1}^k U_i \bigg) \\[6pt]
&= \sum_{i=1}^k \mathbb{V} ( U_i ) \\[6pt]
&= \sum_{i=1}^k \theta_i(1-\theta_i). \\[6pt]
\end{align}$$
Now, if $k$ is not too small, we can invoke the CLT to approximate the distribution by a normal distribution with the same mean and variance, giving the resulting approximation:
$$X \overset{\text{approx}}{\sim} \text{N}(\mu, \sigma^2)
\quad \quad \quad \quad \quad
\mu \equiv \sum_{i=1}^k \theta_i
\quad \quad \quad \quad \quad
\sigma^2 \equiv \sum_{i=1}^k \theta_i(1-\theta_i).$$
There are various ways you can obtain the specific approximating distribution you want to use (e.g., using continuous approximation with or without continuity correction, discrete approximation, etc.) and these will give you different formulae for the approximate probability of interest. I quite like to use the approximation obtained by taking the density at the exact discrete points and then normalising, which gives:
$$\mathbb{P}(X \geqslant r)
\approx \frac{\sum_{x=r}^k \text{N}(x|\mu, \sigma^2)}{\sum_{x=0}^k \text{N}(x|\mu, \sigma^2)}.$$
(Note here that I use the notation $r$ instead of $n$ for the cut-off value of interest; it is bad practice to use the latter as notation, since it is the standard notation for the number of data points in the analysis.) It ought to be computationally feasible to compute this approximate probability for each of your rows. Each computation requires you to first compute $\mu$ and $\sigma^2$ from the above formulae and then compute the approximate probability
Implementation in R: You can implement this approximation in R
quite simply using the existing facilities for the normal distribution. It is best to undertake computations in log-space for accuracy. Here is a function that will compute the approximate probability/log-probability for an input probability vector theta
and input value n
(with an option log
to return the result as a log-probability instead of a probability).
#Create a function to compute the approximate probability
prob.approx <- function(theta, r, log = FALSE) {
k <- length(theta)
if (r > k) { if (log) { return(-Inf) } else { return(0) } }
MEAN <- sum(theta)
VAR <- sum(theta*(1-theta))
LOGS <- dnorm(0:k, mean = MEAN, sd = sqrt(VAR), log = TRUE)
LOGP <- matrixStats::logSumExp(LOGS[(r+1):(k+1)]) - matrixStats::logSumExp(LOGS)
if (log) { LOGP } else { exp(LOGP) } }
We can experiment to see how fast this runs using the microbenchmark
package. First I will set up some mock data where the number of elements in the probability vectors is Poisson distributed with a mean of 40 values. Then I will run a benchmarking test on a procedure that computes the approximating probability for $n=1000$ values (using a value for $r$ that is usually, but not always, below the number of trials).
#Set up some mock data (don't count this part in speed test)
set.seed(1)
n <- 1000
LAM <- 40
LENG <- rpois(n, lambda = LAM)
DATA <- vector(n, mode = 'list')
for (i in 1:n) { DATA[[i]] <- runif(LENG[i]) }
#Set up procedure
PROC <- function(data, r, log = FALSE) {
n <- length(data)
OUT <- rep(0, n)
for (i in 1:n) { OUT[i] <- prob.approx(data[[i]], r, log = log) }
OUT }
#Benchmark the procedure
library(microbenchmark)
set.seed(1)
microbenchmark(PROC(DATA, 30))
Unit: milliseconds
expr min lq mean median uq max neval
PROC(DATA, 30) 12.0072 12.2244 12.95248 12.27545 12.6327 26.7406 100
As can be seen from the output of this benchmarking test, it took about 13 milliseconds to compute the approximating probabilities for $n=1000$ data values. Consequently, it should be possible to compute for $n=10^6$ values in about 13 seconds. While I have not used quite the same spread of vector lengths as what you describe in your question, this method ought to be computationally feasible in your problem.
Best Answer
I wanted to sum up what ultimately I've found works well for this case and roll up what's been discussed in some of the comments.
Since I am basically wanting to know how many events occurred out of series of independent Bernoulli trials, we can model this as a Poisson Binomial distribution.
The mean of that distribution (which is my estimate) is the sum of all the probabilities.
The variance of that distribution is also known: the sum of all (1 - pi)pi.
That already gives me the standard deviation (the square root of the variance), so I've gotten a little bit of what the stakeholder wants.
We know that often the Poisson Binomial approaches normal with larger number of parameters (constituent probabilities from the trials), and luckily because of the nature of my problem, I will almost always have a large number of parameters. When that's the case, we can get an interval by simply taking the mean +/- 1.96 * sdev.
That's not an assumption I can just make easily though, so I did a bit of modelling to find out find out what are the typical probabilities for a given widget batch (the example scenario above), and what are the more extreme probability distributions, each at the different likely parameter/probability sizes.
I then calculated the Poisson Binomial mean, sdev, and 95% intervals using the approaches noted and ran monte carlo against them, doing this many times (with different draws from the middle and the extremes) to try and get as much coverage of the possible probability distributions I could run into. I compared the mean, sdev, and interval estimates against those retrieved from monte carlo by their percent error (treating the monte carlo result as the "right" answer).
Long story short, the mean and sdev values are in line with Poisson Binomial quite well. And the confidence intervals, while possibly being up to 5% off in the more extreme cases, are well within our business need for low parameter sizes, and have almost no error for higher parameter sizes (which is most of our cases).
So ultimately I used what I can from the known moments from the Poisson Binomial distribution directly, and then back those into intervals assuming normality (which works well enough for our specific use case).