[Math] How to derive the posterior predictive distribution

bayesianprobabilityprobability distributionsstatistics

I often seen the posterior predictive distribution mentioned in the context of machine learning and bayesian inference. The definition is as follows:

$ p(D'|D) = \int_\theta p(D'|\theta)p(\theta|D)$

How/why does the integral on the right equal the probability distribution on the left? In other words, which laws of probability can I use to derive $p(D'|D)$ given the integral?

Edit – After further consideration, I think I am able to see much of the derivation. That is,

$p(D'|D) = \int_\theta p(D', \theta | D)$ via the law of total probability
$p(D'|D) = \int_\theta p(D' | D, \theta) * p(\theta | D)$ via the chain rule

But I don't understand why $D$ may be dropped from the list of conditioned variables belonging to the integral's first term.

Best Answer

To show this one can follow a somewhat standard argument. In what follows, for notational convenience, I have replaced your "$D$"s with "$S$"s. By the law of total expectation (in terms of conditional expectation) and Fubini's theorem, applied to any bounded measurable function $f$ defined on the relevant sample space $\Omega$, we observe that $$ \eqalign{ \int_{\Omega}f(s^{'})p(s^{'}\mid s)\mathrm ds^{'}&=\mathbb E[f(S^{'})\mid S=s]=\mathbb E[E[f(S^{'})\mid \Theta\,,s]\mid S=s]\\&=\int_{\theta}\left(\int_{\Omega}f(s^{'})p(s^{'}\mid \theta\,,s) \mathrm ds^{'}\right)p(\theta\mid s)\mathrm d\theta \\&= \int_{\Omega}f(s^{'})\left(\int_{\theta}p(s^{'}\mid \theta\,,s)p(\theta\mid s)\mathrm d\theta\right) \mathrm ds^{'}} $$

Since the far l.h.s. is equal to the far r.h.s. for all bounded measurable functions, we conclude that $$ p(s^{'}\mid s)=\int_{\theta}p(s^{'}\mid \theta\,,s)p(\theta\mid s)\mathrm d\theta $$

Related Question