Distributions – Sum of Beta-Bernoulli Variables

bernoulli-distributionbernoulli-processbeta distributionbeta-binomial distributionbinomial distribution

Assume you have $x_i \sim \operatorname{Bernoulli}(p_i)$ with $p_i \sim \operatorname{Beta}(\alpha,\beta)$.

I am exploring $Z=X_1+ \dots +X_n$

According to this page, it is
$Z \sim \operatorname{BetaBinomial}(n,\alpha,\beta)$

but according to this page, with simulation, it is $Z \sim \operatorname{Binomial}(n,\frac{\alpha}{\alpha+\beta})$

Two different distributions. Which one is correct?

Also equally important, what is the difference in assumptions for each? So I know when to use them correctly and can simulate either one in the right context. Thanks!

Best Answer

Short summary: if the $p_i$s are independent, it's the binomial. If the $p_i$s are all equal, it's the beta-binomial.

By $X_i \sim \textrm{Bernoulli}(p_i)$, you must mean the conditional distribution $X_i\mid p_i \sim \textrm{Bernoulli}(p_i)$. The marginal distribution of $X_i$ (that is, the distribution obtained by averaging over different values of $p_i$s) is obtained, e.g., by noting that $X_i$ is Bernoulli, and computing the expectation using the tower law: \begin{equation} \mathbb{E}(X_i) = \mathbb{E}(\mathbb{E}(X_i \mid p_i)) = \mathbb{E}(p_i) = \frac{\alpha}{\alpha+\beta}. \end{equation} So, $X_i \sim \textrm{Bernoulli}(\frac{\alpha}{\alpha+\beta})$, for all $i$. However, this does not yet determine the distribution of $Z$ as you have not specified enough information to deduce the joint distribution of $X_i$s. Two additional things are needed:

  1. The conditional distribution of $X_i$s given the $p_i$s, $p(X_1,\ldots,X_n \mid p_1,\ldots,p_n)$.
  2. The joint distribution of the $p_i$s: $p(p_1,\ldots,p_n)$.

For the first part, I guess&assume you intend the $X$s to be conditionally independent given the $p$s. The difference of the two distributions mentioned in your question stems from the second point.

If we assume the $p$s to be mutually independent, then the $X$s will be mutually independent, too, as each $X$ depends on only one $p$ and the $X$s are conditionally independent given the $p$s. Then, $Z$ is just the sum of iid. Bernoulli random variables. But this is the definition of the binomial distribution, and thus indeed \begin{equation} Z \sim \textrm{Binomial}\left(n,\frac{\alpha}{\alpha+\beta}\right). \end{equation} Note that in this case it did not make much sense to define the $p$s in the first place: if each $p_i$ influences only the corresponding $X_i$, nothing is learned about the distribution of $p_i$ except via the value of $X_i$. Thus, the $p_i$s are useless in the sense that the exactly same model would have been easier to specify by just stating that $X_i$s are independent Bernoulli random variables with parameter $\alpha/(\alpha+\beta)$.

The beta-binomial distribution is actually defined so that there is only one $p$ parameter drawn from the beta distribution, and then all $X_i$s are Bernoulli with this $p$. This is obtained as a special case of your question by stating in my "step 2" that the joint distribution of the $p_i$s are all the same.

By defining some other dependence structures for the $p_i$s (not independent, but not constrained to be equal either), other distributions for $Z$ would be obtained, but I don't know if any of these have special names.

Related Question