Probability – Measure Theoretic Formulation of Bayes’ Theorem Explained

bayesianmathematical-statisticsmeasure-theoryprobability

I am trying to find a measure theoretic formulation of Bayes' theorem, when used in statistical inference, Bayes' theorem is usually defined as:

$$p\left(\theta|x\right) = \frac{p\left(x|\theta\right) \cdot p\left(\theta\right)}{p\left(x\right)}$$

where:

  • $p\left(\theta|x\right)$: the posterior density of the parameter.
  • $p\left(x|\theta\right)$: the statistical model (or likelihood).
  • $p\left(\theta\right)$: the prior density of the parameter.
  • $p\left(x\right)$: the evidence.

Now how would we define Bayes' theorem in a measure theoretic way?

So, I started by defining a probability space:

$$\left(\Theta, \mathcal{F}_\Theta, \mathbb{P}_\Theta\right)$$

such that $\theta \in \Theta$.

I then defined another probability space:

$$\left(X, \mathcal{F}_X, \mathbb{P}_X\right)$$

such that $x \in X$.

From here now on I don't know what to do, the joint probability space would be:

$$\left(\Theta \times X, \mathcal{F}_\Theta \otimes \mathcal{F}_X, ?\right)$$

but I don't know what the measure should be.

Bayes' theorem should be written as follow:

$$? = \frac{? \cdot \mathbb{P}_\Theta}{\mathbb{P}_X}$$

where:

$$\mathbb{P}_X = \int_{\theta \in \Theta} ? \space \mathrm{d}\mathbb{P}_\Theta$$

but as you can see I don't know the other measures and in which probability space they reside.

I stumbled upon this thread but it was of little help and I don't know how was the following measure-theoretic generalization of Bayes' rule reached:

$${P_{\Theta |y}}(A) = \int\limits_{x \in A} {\frac{{\mathrm d{P_{\Omega |x}}}}{{\mathrm d{P_\Omega }}}(y)\mathrm d{P_\Theta }}$$

I'm self-studying measure theoretic probability and lack guidance so excuse my ignorance.

Best Answer

One precise formulation of Bayes' Theorem is the following, taken verbatim from Schervish's Theory of Statistics (1995).

The conditional distribution of $\Theta$ given $X=x$ is called the posterior distribution of $\Theta$. The next theorem shows us how to calculate the posterior distribution of a parameter in the case in which there is a measure $\nu$ such that each $P_\theta \ll \nu$.

Theorem 1.31 (Bayes' theorem). Suppose that $X$ has a parametric family $\mathcal{P}_0$ of distributions with parameter space $\Omega$. Suppose that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and let $f_{X\mid\Theta}(x\mid\theta)$ be the conditional density (with respect to $\nu$) of $X$ given $\Theta = \theta$. Let $\mu_\Theta$ be the prior distribution of $\Theta$. Let $\mu_{\Theta\mid X}(\cdot \mid x)$ denote the conditional distribution of $\Theta$ given $X = x$. Then $\mu_{\Theta\mid X} \ll \mu_\Theta$, a.s. with respect to the marginal of $X$, and the Radon-Nikodym derivative is $$ \tag{1} \label{1} \frac{d\mu_{\Theta\mid X}}{d\mu_\Theta}(\theta \mid x) = \frac{f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} $$ for those $x$ such that the denominator is neither $0$ nor infinite. The prior predictive probability of the set of $x$ values such that the denominator is $0$ or infinite is $0$, hence the posterior can be defined arbitrarily for such $x$ values.


Edit 1. The setup for this theorem is as follows:

  1. There is some underlying probability space $(S, \mathcal{S}, \Pr)$ with respect to which all probabilities are computed.
  2. There is a standard Borel space $(\mathcal{X}, \mathcal{B})$ (the sample space) and a measurable map $X : S \to \mathcal{X}$ (the sample or data).
  3. There is a standard Borel space $(\Omega, \tau)$ (the parameter space) and a measurable map $\Theta : S \to \Omega$ (the parameter).
  4. The distribution of $\Theta$ is $\mu_\Theta$ (the prior distribution); this is the probability measure on $(\Omega, \tau)$ given by $\mu_\Theta(A) = \Pr(\Theta \in A)$ for all $A \in \tau$.
  5. The distribution of $X$ is $\mu_X$ (the marginal distribution mentioned in the theorem); this is the probability measure on $(\mathcal{X}, \mathcal{B})$ given by $\mu_X(B) = \Pr(X \in B)$ for all $B \in \mathcal{B}$.
  6. There is a probability kernel $P : \Omega \times \mathcal{B} \to [0, 1]$, denoted $(\theta, B) \mapsto P_\theta(B)$ which represents the conditional distribution of $X$ given $\Theta$. This means that

    • for each $B \in \mathcal{B}$, the map $\theta \mapsto P_\theta(B)$ from $\Omega$ into $[0, 1]$ is measurable,
    • $P_\theta$ is a probability measure on $(\mathcal{X}, \mathcal{B})$ for each $\theta \in \Omega$, and
    • for all $A \in \tau$ and $B \in \mathcal{B}$, $$ \Pr(\Theta \in A, X \in B) = \int_A P_\theta(B) \, d\mu_\Theta(\theta). $$

    This is the parametric family of distributions of $X$ given $\Theta$.

  7. We assume that there exists a measure $\nu$ on $(\mathcal{X}, \mathcal{B})$ such that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and we choose a version $f_{X\mid\Theta}(\cdot\mid\theta)$ of the Radon-Nikodym derivative $d P_\theta / d \nu$ (strictly speaking, the guaranteed existence of this Radon-Nikodym derivative might require $\nu$ to be $\sigma$-finite). This means that $$ P_\theta(B) = \int_B f_{X\mid\Theta}(x \mid \theta) \, d\nu(x) $$ for all $B \in \mathcal{B}$. It follows that $$ \Pr(\Theta \in A, X \in B) = \int_A \int_B f_{X \mid \Theta}(x \mid \theta) \, d\nu(x) \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $B \in \mathcal{B}$. We may assume without loss of generality (e.g., see exercise 9 in Chapter 1 of Schervish's book) that the map $(x, \theta) \mapsto f_{X\mid \Theta}(x\mid\theta)$ of $\mathcal{X}\times\Omega$ into $[0, \infty]$ is measurable. Then by Tonelli's theorem we can change the order of integration: $$ \Pr(\Theta \in A, X \in B) = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x) $$ for all $A \in \tau$ and $B \in \mathcal{B}$. In particular, the marginal probability of a set $B \in \mathcal{B}$ is $$ \mu_X(B) = \Pr(X \in B) = \int_B \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x), $$ which shows that $\mu_X \ll \nu$, with Radon-Nikodym derivative $$ \frac{d\mu_X}{d\nu} = \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta). $$
  8. There exists a probability kernel $\mu_{\Theta \mid X} : \mathcal{X} \times \tau \to [0, 1]$, denoted $(x, A) \mapsto \mu_{\Theta \mid X}(A \mid x)$, which represents the conditional distribution of $\Theta$ given $X$ (i.e., the posterior distribution). This means that
    • for each $A \in \tau$, the map $x \mapsto \mu_{\Theta \mid X}(A \mid x)$ from $\mathcal{X}$ into $[0, 1]$ is measurable,
    • $\mu_{\Theta \mid X}(\cdot \mid x)$ is a probability measure on $(\Omega, \tau)$ for each $x \in \mathcal{X}$, and
    • for all $A \in \tau$ and $B \in \mathcal{B}$, $$ \Pr(\Theta \in A, X \in B) = \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) $$

Edit 2. Given the setup above, the proof of Bayes' theorem is relatively straightforward.

Proof. Following Schervish, let $$ C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\} $$ and $$ C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\} $$ (these are the sets of potentially problematic $x$ values for the denominator of the right-hand-side of \eqref{1}). We have $$ \mu_X(C_0) = \Pr(X \in C_0) = \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0, $$ and $$ \mu_X(C_\infty) = \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = \begin{cases} \infty, & \text{if $\nu(C_\infty) > 0$,} \\ 0, & \text{if $\nu(C_\infty) = 0$.} \end{cases} $$ Since $\mu_X(C_\infty) = \infty$ is impossible ($\mu_X$ is a probability measure), it follows that $\nu(C_\infty) = 0$, whence $\mu_X(C_\infty) = 0$ as well. Thus, $\mu_X(C_0 \cup C_\infty) = 0$, so the set of all $x \in \mathcal{X}$ such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.

Next, consider that, if $A \in \tau$ and $B \in \mathcal{B}$, then $$ \Pr(\Theta \in A, X \in B) = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x) $$ and simultaneously $$ \begin{aligned} \Pr(\Theta \in A, X \in B) &= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\ &= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x). \end{aligned} $$ It follows that $$ \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\nu$-a.e. $x \in \mathcal{X}$, and hence $$ \mu_{\Theta \mid X}(A \mid x) = \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\mu_X$-a.e. $x \in \mathcal{X}$. Thus, for $\mu_X$-a.e. $x \in \mathcal{X}$, $\mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta$, and the Radon-Nikodym derivative is $$ \frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x) = \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)}, $$ as claimed, completing the proof.


Lastly, how do we reconcile the colloquial version of Bayes' theorem found so commonly in statistics/machine learning literature, namely, $$ \tag{2} \label{2} p(\theta \mid x) = \frac{p(\theta) p(x \mid \theta)}{p(x)}, $$ with \eqref{1}?

On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of $\Theta$ given $X$ with respect to some unspecified dominating measure on the parameter space. In fact, none of the dominating measures for the four different densities in \eqref{2} (all named $p$) are explicitly mentioned.

On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of $\Theta$ given $X$ with respect to the prior distribution.

If, in addition, the prior distribution $\mu_\Theta$ has a density $f_\Theta$ with respect to some (let's say $\sigma$-finite) measure $\lambda$ on the parameter space $\Omega$, then $\mu_{\Theta \mid X}(\cdot\mid x)$ is also absolutely continuous with respect to $\lambda$ for $\mu_X$-a.e. $x \in \mathcal{X}$, and if $f_{\Theta \mid X}$ represents a version of the Radon-Nikodym derivative $d\mu_{\Theta\mid X}/d\lambda$, then \eqref{1} yields $$ \begin{aligned} f_{\Theta \mid X}(\theta \mid x) &= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}. \end{aligned} $$ The translation between this new form and \eqref{2} is $$ \begin{aligned} p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\ p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\ p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\ p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)} \end{aligned} $$

Related Question