Bernoulli Random Variables – How to Efficiently Model the Sum?

binomial distributiondistributionspoisson-binomial-distributionrrandom variable

I am modeling a random variable ($Y$) which is the sum of some ~15-40k independent Bernoulli random variables ($X_i$), each with a different success probability ($p_i$). Formally, $Y=\sum X_i$ where $\Pr(X_i=1)=p_i$ and $\Pr(X_i=0)=1-p_i$.

I am interested in quickly answering queries such as $\Pr(Y<=k)$ (where $k$ is given).

Currently, I use random simulations to answer such queries. I randomly draw each $X_i$ according to its $p_i$, then sum all $X_i$ values to get $Y'$. I repeat this process a few thousand times and return the fraction of times $\Pr(Y'\leq k)$.

Obviously, this is not totally accurate (although accuracy greatly increases as the number of simulations increases). Also, it seems I have enough data about the distribution to avoid the use simulations. Can you think of a reasonable way to get the exact probability $\Pr(Y\leq k)$?

p.s.

I use Perl & R.

EDIT

Following the responses I thought some clarifications might be needed. I will shortly describe the setting of my problem. Given is a circular genome with circumference c and a set of n ranges mapped to it. For example, c=3*10^9 and ranges={[100,200],[50,1000],[3*10^9-1,1000],...}. Note all ranges are closed (both ends are inclusive). Also note that we only deal with integers (whole units).

I am looking for regions on the circle that are undercovered by the given n mapped ranges. So to test whether a given a range of length x on the circle is undercovered, I test the hypothesis that the n ranges are mapped randomly. The probability a mapped range of length q>x will fully cover the given range of length x is (q-x)/c. This probability becomes quite small when c is large and/or q is small. What I'm interested is the number of ranges (out of n) which cover x. This is how Y is formed.

I test my null hypothesis vs. one sided alternative (undercoverage). Also note I am testing multiple hypothesis (different x lengths), and sure to correct for this.

Best Answer

If it often resembles a Poisson, have you tried approximating it by a Poisson with parameter $\lambda = \sum p_i$ ?

EDIT: I've found a theoretical result to justify this, as well as a name for the distribution of $Y$: it's called the Poisson binomial distribution. Le Cam's inequality tells you how closely its distribution is approximated by the distribution of a Poisson with parameter $\lambda = \sum p_i$. It tells you the quality of this approx is governed by the sum of the squares of the $p_i$s, to paraphrase Steele (1994). So if all your $p_i$s are reasonably small, as it now appears they are, it should be a pretty good approximation.

EDIT 2: How small is 'reasonably small'? Well, that depends how good you need the approximation to be! The Wikipedia article on Le Cam's theorem gives the precise form of the result I referred to above: the sum of the absolute differences between the probability mass function (pmf) of $Y$ and the pmf of the above Poisson distribution is no more than twice the sum of the squares of the $p_i$s. Another result from Le Cam (1960) may be easier to use: this sum is also no more than 18 times the largest $p_i$. There are quite a few more such results... see Serfling (1978) for one review.

Best Answer

Related Solutions

Solved – How to model the sum of Bernoulli random variables for dependent data

Solved – Sum of Bernoulli random variables

Related Question