Bayesian Inference – Differences Between Prior Distribution and Prior Predictive Distribution

bayesiandata mininghierarchical-bayesianinferencemachine learning

While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.

Best Answer

Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observations.

If $X$ denotes the observations and we use the model (or likelihood) $p(x \mid \theta)$ for $\theta \in \Theta$ then a prior distribution is a distribution for $\theta$, for example $p_\beta(\theta)$ where $\beta$ is a set of hyperparameters. Note that there's no conditioning on $\beta$, and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.

The prior predictive distribution is the distribution of $X$ "averaged" over all possible values of $\theta$:

\begin{align*} p_\beta(x) &= \int_\Theta p(x , \theta) d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta) d\theta \end{align*}

This distribution is prior as it does not rely on any observation.

We can also define in the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, \dots, X_n)$, the posterior predictive distribution is:

\begin{align*} p_\beta(x \mid X) &= \int_\Theta p(x ,\theta \mid X) d\theta \\ &= \int_\Theta p(x \mid \theta,X) p_\beta(\theta \mid X)d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta \mid X)d\theta. \end{align*} The last line is based on the assumption that the upcoming observation is independent of $X$ given $\theta$.

Thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_\beta(\theta)$ in the former we weight with $p_\beta(\theta \mid X)$ that is with our "updated" knowledge about $\theta$.

Example : Beta-Binomial

Suppose our model is $X \mid \theta \sim {\rm Bin}(n,\theta)$ i.e $P(X = x \mid \theta) = \theta^x(1-\theta)^{n-x}$.
Here $\Theta = [0,1]$.

We also assume a beta prior distribution for $\theta$, $\beta(a,b)$, where $(a,b)$ is the set of hyper parameters.

The prior predictive distribution, $p_{a,b}(x)$, is the beta-binomial distribution with parameters $(n,a,b)$.

This discrete distribution gives the probability of getting $k$ successes out of $n$ trials given the hyper-parameters $(a,b)$ on the probability of success.

Now suppose we observe $n_1$ draws $(x_1, \dots, x_{n_1})$ with $m$ successes.

Since the binomial and beta distributions are conjugate distributions we have: \begin{align*} p(\theta \mid X=m) &\propto \theta^m (1 - \theta)^{n_1-m} \times \theta^{a-1}(1-\theta)^{b-1}\\ &\propto \theta^{a+m-1}(1-\theta)^{n_1+b-m-1} \\ &\propto \beta(a+m,n_1+b-m) \end{align*}

Thus $\theta \mid X$ follows a beta distribution with parameters $(a+m,n_1+b-m)$.

Then, $p_{a,b}(x \mid X = m)$ is also a beta-binomial distribution but this time with parameters $(n_2,a+m,b+n_1-m)$ rather than $(n_2,a,b)$.

Upon a $\beta(a,b)$ prior distribution and a ${\rm Bin}(n,\theta)$ likelihood, if we observe $m$ successes out of $n_1$ trials, the posterior predictive distribution is a beta-binomial with parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play different roles here, since the posterior predictive distribution is about:

Given my current knowledge on $\theta$ after observing $m$ successes out of $n_1$ trials, i.e $\beta(n_1,a+x,n+b-x)$, what probability do I have of observing $k$ successes out of $n_2$ additional trials?

I hope this is useful and clear.

Related Question