Solved – the CDF of the sum of weighted Bernoulli random variables

approximationbernoulli-distributioncumulative distribution functiondistributionspoisson-binomial-distribution

Let's say we have a random variable $Y$ defined as the sum of $N$ Bernoulli variables $X_i$, each with a different, success probability $p_i$ and a different (fixed) weight $w_i$. The weights are positive and between ~0.001-1,000 with a handful of outliers.
If it would make the problem more tractable, it would be OK to discard both the very large and very small outliers.

Formally,
$Y = \sum X_i w_i$

Where $\Pr(X_i=1)=p_i$ and $\Pr(X_i=0)=1-p_i$.

$N$ can range from ~10-10,000. I would like to quickly compute a approximation to $\Pr(Y<=k)$ (where $k$ is given).

The data I'm handling comes from real life bets, each with implied odds $p_i$ and wager $w_i$. Here are plots showing the rough distribution of the probabilities and weights:

Values of P
enter image description here
enter image description here

One way to compute the approximation is through Monte Carlo simulation, but that can get slow for large values of $N$. One can also use the Central Limit Theorem to compute this for large $N$, but the accuracy is quite poor for small $N$ (or if there are a handful of large weights). Finally, there are ways to improve on the CLT approach by using asymptoptic expansions, e.g.:

Volkova, A. Y. (1996). A refinement of the central limit theorem for sums of independent random indicators. Theory of Probability and its Applications 40, 791-794.

However, as far as I've seen, the refined approximations only specify how to compute $Y$ if all weights $W_i$ are equal to 1 (a standard Poisson Binomial distribution).
Is there a way to compute an answer that works for small and large $N$ with a smaller error than a pure CLT approach without having to resort to Monte Carlo Simulation? In other words, is it possible to extend the improved closed-form approximations to allow weights not equal to 1?

Some related questions that don't quite answer mine:

Best Answer

One possibility is to use the saddlepoint approximation. For that we need the mgf (moment generating function) and its logarithm the cgf (cumulant generating function.) The mgf of a Bernoulli variable with parameter $p_i$ is $$ \DeclareMathOperator{\E}{\mathbb{E}} M_i(t)=\E e^{t X_i}=(1-p_i)+p_i e^t $$ Then $$ \E e^{tY}=\prod_i \E e^{t w_i}=\prod_i M_i(t w_i) $$ and the cgf is $$ K(t)=\sum_{i=1}^n \log\left( 1-p_i+e^{t w_i} \right) $$ Then we can develop the saddlepoint approximation following How does saddlepoint approximation work?.