[Math] Distribution of repeated binomial processes where success probability and number of trials changes each time

binomial distributionprobabilityprobability theory

I have an experiment where each "run" of the experiment has a binomial distribution. In a given run I have a number of trials $N_i$ and probability of success $p_i$. The result is number of successes $S_i$ which is a sample from the binomial distribution. For this single run of the experiment, I know the variance is $N_i p_i(1-p_i)$.

In a different run the probability of success and the number of trials changes. Call these $N_j$ and $p_j$.

The number of trials and success probabilities are in turn drawn from their own distributions, so each $N_j$ and $p_j$ is a sample from its own distribution.

If I know the distribution of the success probabilities and the distribution of the number of trials, then what is the distribution of the entire set of runs? I'm most interested in the mean and the variance of the set of runs.

In essence, I have a set of samples all drawn from different (but related) binomial distributions. I want to know the mean and variance of this set.
I think this can be thought of as a compound distribution:
https://en.wikipedia.org/wiki/Compound_probability_distribution

For the purpose of this question, let's say that the distribution of the success probabilities $p_i$ is Beta with some mean and variance: $p\sim (\mu_p,\sigma^2_p)$, and the distribution of the number of trials is Gaussian: $N\sim \mathcal{N}(\mu_N,\sigma^2_N)$.

I was initially thinking to solve this as a special case of the Poisson binomial distribution, where I sum over the total number of trials and I get something like
$\sigma^2 = \sum_{i=1}^{M_{trials}}N_ip_i(1-p_i)$ for the variance and $\mu = \sum_{i=1}^{M_{trials}}N_ip_i$ for the mean. But this isn't really useful since I have lots of different "runs" and I do know the distributions of the number of trials and the success probabilities. It seems like I should be able get something more compact. Ideally, I would have an expression for the variance of the set of runs in terms of the means and variances of $N$ and $p$.

For a set of runs, each with variance $N_i p_i(1-p_i)$ should I calculate the variance of the quantity $N_i p_i(1-p_i)$ instead of taking the sum? This is the variance of the variance, and it doesn't really seem like the correct thing to do. I'm stuck on how I can express the sum $\sigma^2 = \sum_{i=1}^{N_{total}}N_ip_i(1-p_i)$ as something more compact when I know the distributions of N and p.

One thing that I have been stumbling on is that my variance, $\sigma^2 = \sum_{i=1}^{N_{total}}N_ip_i(1-p_i)$ appears to be expressed as a sum of random variables: $N,p$. In reality, though, it is expressed as a sum of samples of random variables.

Best Answer

In general, if $p_1, p_2, \ldots, p_m \in (0,1)$ are IID realizations from some probability distribution with mean $\mu_p$ and standard deviation $\sigma_p$, and $n_1, n_2, \ldots, n_m \in \mathbb Z^+$ are IID realizations of from another probability distribution with mean $\mu_n$ and standard deviation $\sigma_n$, and for each $i = 1, 2, \ldots, m$, we have random variables $$X_i \sim \operatorname{Binomial}(n_i, p_i),$$ and we are interested in the distribution of $S = \sum_{i=1}^m X_i$, then we have by linearity of expectation $$\operatorname{E}[S] = \sum_{i=1}^m \operatorname{E}[X_i].$$ In turn, for each $X_i$, we have by the law of total expectation $$\operatorname{E}[X_i] = \operatorname{E}[\operatorname{E}[X_i \mid (n_i \cap p_i)]] = \operatorname{E}[n_i p_i] = \operatorname{E}[n_i]\operatorname{E}[p_i] = \mu_n \mu_p;$$ thus $$\operatorname{E}[S] = m\mu_n \mu_p.$$ This assumes that $n_i$ and $p_i$ are independent for each $i$ (from which it follows that each $X_i$ is independent). The variance calculation is done in a similar fashion; $$\operatorname{Var}[S] \overset{\text{ind}}{=} \sum_{i=1}^m \operatorname{Var}[X_i],$$ whence by the law of total variance $$\begin{align*} \operatorname{Var}[X_i] &= \operatorname{Var}[\operatorname{E}[X_i \mid (n_i \cap p_i)]] + \operatorname{E}[\operatorname{Var}[X_i \mid (n_i \cap p_i)]] \\ &= \operatorname{Var}[n_i p_i] + \operatorname{E}[n_i p_i (1-p_i)] \\ &= (\sigma_n^2 \sigma_p^2 + \sigma_n^2 \mu_p^2 + \sigma_p^2 \mu_n^2) + \mu_n \operatorname{E}[p_i(1-p_i)] \\ &= (\sigma_n^2 \sigma_p^2 + \sigma_n^2 \mu_p^2 + \sigma_p^2 \mu_n^2) + \mu_n (\mu_p - (\sigma_p^2 + \mu_p^2)). \end{align*}$$

To understand the variance of $n_i p_i$, note that for two independent random variables $A$, $B$, with means and standard deviations $\mu_A, \sigma_A, \mu_B, \sigma_B$, respectively, $$\begin{align*}\operatorname{Var}[AB] &= \operatorname{E}[(AB)^2] - \operatorname{E}[AB]^2 \\ &= \operatorname{E}[A^2 B^2] - \operatorname{E}[A]^2 \operatorname{E}[B]^2 \\ &= \operatorname{E}[A^2]\operatorname{E}[B^2] - \mu_A^2 \mu_B^2 \\ &= (\operatorname{Var}[A] + \operatorname{E}[A]^2)(\operatorname{Var}[B] + \operatorname{E}[B]^2) - \mu_A^2 \mu_B^2 \\ &= (\sigma_A^2 + \mu_A^2)(\sigma_B^2 + \mu_B^2) - \mu_A^2 \mu_B^2 \\ &= \sigma_A^2 \sigma_B^2 + \sigma_A^2 \mu_B^2 + \sigma_B^2 \mu_A^2. \end{align*}$$

Note that my computation of the variance differs from yours. I have substantiated my results by simulating $m = 10^6$ observations from $X_i$ where $n_i \sim \operatorname{Poisson}(\lambda)$ and $p_i \sim \operatorname{Beta}(a,b)$, for $\lambda = 11$ and $(a,b) = (7,3)$. This should result in $\operatorname{Var}[X_i] = 1001/100$; your results do not match. I should also point out that the reason that your computation does not work is because the total variance of each $X_i$ is not merely due to the expectation of the conditional variance of $X_i$ given $n_i$ and $p_i$; the other term in the law of total variance must also be included, which captures the variability of the conditional expectation of $X_i$. In other words, there is variation in $X_i$ coming from the binomial variance even when $n_i$ and $p_i$ are fixed, but there is also additional variation in the location of $X_i$ arising from the fact that $n_i$ and $p_i$ are not fixed.

Best Answer

Related Solutions

[Math] Why is the sum of the rolls of two dices a Binomial Distribution? What is defined as a success in this experiment

A binomial distribution has mean of $9.6$ and standard deviation of $2.4.$ Find $n$ (no. of trials) and $p$ (probability of success in each trial)

Related Question