Let's say we have a random variable $Y$ defined as the sum of $N$ Bernoulli variables $X_i$, each with a different, success probability $p_i$ and a different (fixed) weight $w_i$. The weights are positive and between ~0.001-1,000 with a handful of outliers.
If it would make the problem more tractable, it would be OK to discard both the very large and very small outliers.
Formally,
$Y = \sum X_i w_i$
Where $\Pr(X_i=1)=p_i$ and $\Pr(X_i=0)=1-p_i$.
$N$ can range from ~10-10,000. I would like to quickly compute a approximation to $\Pr(Y<=k)$ (where $k$ is given).
The data I'm handling comes from real life bets, each with implied odds $p_i$ and wager $w_i$. Here are plots showing the rough distribution of the probabilities and weights:
One way to compute the approximation is through Monte Carlo simulation, but that can get slow for large values of $N$. One can also use the Central Limit Theorem to compute this for large $N$, but the accuracy is quite poor for small $N$ (or if there are a handful of large weights). Finally, there are ways to improve on the CLT approach by using asymptoptic expansions, e.g.:
Volkova, A. Y. (1996). A refinement of the central limit theorem for sums of independent random indicators. Theory of Probability and its Applications 40, 791-794.
However, as far as I've seen, the refined approximations only specify how to compute $Y$ if all weights $W_i$ are equal to 1 (a standard Poisson Binomial distribution).
Is there a way to compute an answer that works for small and large $N$ with a smaller error than a pure CLT approach without having to resort to Monte Carlo Simulation? In other words, is it possible to extend the improved closed-form approximations to allow weights not equal to 1?
Some related questions that don't quite answer mine:
- CLT can be used for weighted sum of different Bernoulli variables? (Pure CLT is too inaccurate)
- How to find the distribution of the weighted sum of independent Bernoulli random variables for positive non-integer weights (No answer)
- https://math.stackexchange.com/questions/1546366/distribution-of-weighted-sum-of-bernoulli-rvs (The bounds mentioned in [Raghavan, 1988] are very, very wide for small N or when computing the CDF near the mean)
Best Answer
One possibility is to use the saddlepoint approximation. For that we need the mgf (moment generating function) and its logarithm the cgf (cumulant generating function.) The mgf of a Bernoulli variable with parameter $p_i$ is $$ \DeclareMathOperator{\E}{\mathbb{E}} M_i(t)=\E e^{t X_i}=(1-p_i)+p_i e^t $$ Then $$ \E e^{tY}=\prod_i \E e^{t w_i}=\prod_i M_i(t w_i) $$ and the cgf is $$ K(t)=\sum_{i=1}^n \log\left( 1-p_i+e^{t w_i} \right) $$ Then we can develop the saddlepoint approximation following How does saddlepoint approximation work?.