While studying Bayesian statistics, somehow I am facing a problem to understand the differences between prior distribution and prior predictive distribution. Prior distribution is sort of fine to understand but I have found it vague to understand the use of prior predictive distribution and why it is different from prior distribution.
Bayesian Inference – Differences Between Prior Distribution and Prior Predictive Distribution
bayesiandata mininghierarchical-bayesianinferencemachine learning
Best Answer
Predictive here means predictive for observations. The prior distribution is a distribution for the parameters whereas the prior predictive distribution is a distribution for the observations.
If $X$ denotes the observations and we use the model (or likelihood) $p(x \mid \theta)$ for $\theta \in \Theta$ then a prior distribution is a distribution for $\theta$, for example $p_\beta(\theta)$ where $\beta$ is a set of hyperparameters. Note that there's no conditioning on $\beta$, and therefore the hyperparameters are considered fixed, which is not the case in hierarchical models but this not the point here.
The prior predictive distribution is the distribution of $X$ "averaged" over all possible values of $\theta$:
\begin{align*} p_\beta(x) &= \int_\Theta p(x , \theta) d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta) d\theta \end{align*}
This distribution is prior as it does not rely on any observation.
We can also define in the same way the posterior predictive distribution, that is if we have a sample $X = (X_1, \dots, X_n)$, the posterior predictive distribution is:
\begin{align*} p_\beta(x \mid X) &= \int_\Theta p(x ,\theta \mid X) d\theta \\ &= \int_\Theta p(x \mid \theta,X) p_\beta(\theta \mid X)d\theta \\ &= \int_\Theta p(x \mid \theta) p_\beta(\theta \mid X)d\theta. \end{align*} The last line is based on the assumption that the upcoming observation is independent of $X$ given $\theta$.
Thus the posterior predictive distribution is constructed the same way as the prior predictive distribution but while in the latter we weight with $p_\beta(\theta)$ in the former we weight with $p_\beta(\theta \mid X)$ that is with our "updated" knowledge about $\theta$.
Example : Beta-Binomial
Suppose our model is $X \mid \theta \sim {\rm Bin}(n,\theta)$ i.e $P(X = x \mid \theta) = \theta^x(1-\theta)^{n-x}$.
Here $\Theta = [0,1]$.
We also assume a beta prior distribution for $\theta$, $\beta(a,b)$, where $(a,b)$ is the set of hyper parameters.
The prior predictive distribution, $p_{a,b}(x)$, is the beta-binomial distribution with parameters $(n,a,b)$.
This discrete distribution gives the probability of getting $k$ successes out of $n$ trials given the hyper-parameters $(a,b)$ on the probability of success.
Now suppose we observe $n_1$ draws $(x_1, \dots, x_{n_1})$ with $m$ successes.
Since the binomial and beta distributions are conjugate distributions we have: \begin{align*} p(\theta \mid X=m) &\propto \theta^m (1 - \theta)^{n_1-m} \times \theta^{a-1}(1-\theta)^{b-1}\\ &\propto \theta^{a+m-1}(1-\theta)^{n_1+b-m-1} \\ &\propto \beta(a+m,n_1+b-m) \end{align*}
Thus $\theta \mid X$ follows a beta distribution with parameters $(a+m,n_1+b-m)$.
Then, $p_{a,b}(x \mid X = m)$ is also a beta-binomial distribution but this time with parameters $(n_2,a+m,b+n_1-m)$ rather than $(n_2,a,b)$.
Upon a $\beta(a,b)$ prior distribution and a ${\rm Bin}(n,\theta)$ likelihood, if we observe $m$ successes out of $n_1$ trials, the posterior predictive distribution is a beta-binomial with parameters $(n_2,a+x,b+n_1-x)$. Note that $n_2$ and $n_1$ play different roles here, since the posterior predictive distribution is about:
I hope this is useful and clear.