Solved – In Bayesian inference, why are some terms dropped from the posterior predictive

In Kevin Murphy's Conjugate Bayesian analysis of the Gaussian distribution, he writes that the posterior predictive distribution is

$$
p(x \mid D) = \int p(x \mid \theta) p(\theta \mid D) d \theta
$$

where $D$ is the data on which the model is fit and $x$ is unseen data. What I don't understand is why the dependence on $D$ disappears in the first term in the integral. Using basic rules of probability, I would have expected:

$$
\begin{align}
p(a) &= \int p(a \mid c) p(c) dc
\\
p(a \mid b) &= \int p(a \mid c, b) p(c \mid b) dc
\\
&\downarrow
\\
p(x \mid D) &= \int \overbrace{p(x \mid \theta, D)}^{\star} p(\theta \mid D) d \theta
\end{align}
$$

Question: Why does the dependence on $D$ in term $\star$ disappear?

For what it's worth, I've seen this kind of formulation (dropping variables in conditionals) other places. For example, in Ryan Adam's Bayesian Online Changepoint Detection, he writes the posterior predictive as

$$
p(x_{t+1} \mid r_t) = \int p(x_{t+1} \mid \theta) p(\theta \mid r_{t}, x_{t}) d \theta
$$

where again, since $D = \{x_t, r_t\}$, I would have expected

$$
p(x_{t+1} \mid x_t, r_t) = \int p(x_{t+1} \mid \theta, x_t, r_t) p(\theta \mid r_{t}, x_{t}) d \theta
$$

Best Answer

This is based on the assumption that $x$ is conditionally independent of $D$, given $\theta$. This is a reasonable assumption in many cases, because all it says is that the training and testing data ($D$ and $x$, respectively) are independently generated from the same set of unknown parameters $\theta$. Given this independence assumption, $p(x|\theta,D)=p(x|\theta)$, and so the $D$ drops out of the more general form that you expected.

In your second example, it seems that a similar independence assumption is being applied, but now (explicitly) across time. These assumptions may be explicitly stated elsewhere in the text, or they may be implicitly clear to anyone who is sufficiently familiar with the context of the problem (although that doesn't necessarily mean that in your particular examples - which I'm not familiar with - the authors were right to assume this familiarity).

Best Answer

Related Solutions

Solved – Bayesian online changepoint detection (marginal predictive distribution)

Derivation of (1)

Derivation of (1b)

* Remark on the model's conditional independence assumptions

Related Question