[Math] variance of compound binomial distributions

binomial distributionpr.probabilityst.statistics

The below is motivated by a problem I'm observing in my experimental data

I have m boxes, where each box is supposed to contain k molecules of mRNA. The measurement process includes labeling all the molecules with a box-specific tag, mixing them, amplifying them to detectable levels and deconvoluting based on tags. As tag-labeling is a lossy process with estimated efficiency of 5-10% and as we are essentially counting successes a binomial model with n=k and 0.05<=p<=0.1 sounds fitting. Graphing the variance/mean vs. mean shows that it doesn't fit to the straight line expected from a binomial with constant p. If we assume that p is variable, however, things work out well. (see bottom for data)

The intuition is that we are compounding distributions – as an example assume that m/2 boxes are sampled with p=0.05 and m/2 boxes with p=0.1. Then the distribution of all m boxes would be an 'overlap' of 2 binomials each with the same n but different p. Intuitively I would expect the mean to be bound between the mean of the two distributions (as the mean is a point found between the two other points) while I would expect the variance to be larger than either of the two variances (as variance is correlated with the width of the distribution and the 'overlapping' width is by definition longer than each of its components). Simulations support my argument but I don't know how to go about formalizing it and I don't know how to use this to assist is modeling the data. (Esp – how to use this to estimate the variability in tagging efficiency).

What do you say?

The below graph shows real data (each point is generated from 48 boxes with equal k) versus the toy model suggested above (100 boxes with p=0.05 and 100 boxes with p=0.1 for different n).

[EDIT: per comments 'concatenated distribution' was changed to 'compound distribution']

enter image description here

Best Answer

I think you want the law of total variance:

The variance of $\operatorname{Bin}(n,X)$ where $X$ is uniform on $[a,b]$ is $$\begin{eqnarray} \operatorname{Var}(\operatorname{Bin}(n,X))&=&E_X[\operatorname{Var}(\operatorname{Bin}(n,X)|X)] + \operatorname{Var}_X(E[\operatorname{Bin}(n,X)|X])\newline &=& E[nX(1-X)] + \operatorname{Var}(nX) \newline &=& n(E[X]-E[X^2]) + n^2 \operatorname{Var}(X) \newline & =& n \left( \frac{(a+b)}{2} - \frac{(a^2+ab+b^2)}{3}\right) + n^2 \frac{(b-a)^2}{12} \end{eqnarray}$$

For example, if $[a,b] =[1/20,1/10]$, this is $\frac{83n}{1200} + \frac{n^2}{4800} = 0.0691 n + 0.000208n^2$.

If $X$ takes the values $a$ and $b$ with probabilities equal to $1/2$, then

$$\begin{eqnarray} \operatorname{Var}(\operatorname{Bin}(n,X))&=&E_X[\operatorname{Var}(\operatorname{Bin}(n,X)|X)] + \operatorname{Var}_X(E[\operatorname{Bin}(n,X)|X])\newline &=& n\left(\frac{a(1-a)+b(1-b)}{2}\right) + n^2 \frac{(b-a)^2}{4}\end{eqnarray}$$

If $a=1/20$ and $b=1/10$, this is $\frac{11n}{160} + \frac{n^2}{1600} =0.0688 n + 0.000625n^2$.