I am trying to find a measure theoretic formulation of Bayes' theorem, when used in statistical inference, Bayes' theorem is usually defined as:
$$p\left(\theta|x\right) = \frac{p\left(x|\theta\right) \cdot p\left(\theta\right)}{p\left(x\right)}$$
where:
- $p\left(\theta|x\right)$: the posterior density of the parameter.
- $p\left(x|\theta\right)$: the statistical model (or likelihood).
- $p\left(\theta\right)$: the prior density of the parameter.
- $p\left(x\right)$: the evidence.
Now how would we define Bayes' theorem in a measure theoretic way?
So, I started by defining a probability space:
$$\left(\Theta, \mathcal{F}_\Theta, \mathbb{P}_\Theta\right)$$
such that $\theta \in \Theta$.
I then defined another probability space:
$$\left(X, \mathcal{F}_X, \mathbb{P}_X\right)$$
such that $x \in X$.
From here now on I don't know what to do, the joint probability space would be:
$$\left(\Theta \times X, \mathcal{F}_\Theta \otimes \mathcal{F}_X, ?\right)$$
but I don't know what the measure should be.
Bayes' theorem should be written as follow:
$$? = \frac{? \cdot \mathbb{P}_\Theta}{\mathbb{P}_X}$$
where:
$$\mathbb{P}_X = \int_{\theta \in \Theta} ? \space \mathrm{d}\mathbb{P}_\Theta$$
but as you can see I don't know the other measures and in which probability space they reside.
I stumbled upon this thread but it was of little help and I don't know how was the following measure-theoretic generalization of Bayes' rule reached:
$${P_{\Theta |y}}(A) = \int\limits_{x \in A} {\frac{{\mathrm d{P_{\Omega |x}}}}{{\mathrm d{P_\Omega }}}(y)\mathrm d{P_\Theta }}$$
I'm self-studying measure theoretic probability and lack guidance so excuse my ignorance.
Best Answer
One precise formulation of Bayes' Theorem is the following, taken verbatim from Schervish's Theory of Statistics (1995).
Edit 1. The setup for this theorem is as follows:
There is a probability kernel $P : \Omega \times \mathcal{B} \to [0, 1]$, denoted $(\theta, B) \mapsto P_\theta(B)$ which represents the conditional distribution of $X$ given $\Theta$. This means that
This is the parametric family of distributions of $X$ given $\Theta$.
Edit 2. Given the setup above, the proof of Bayes' theorem is relatively straightforward.
Proof. Following Schervish, let $$ C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\} $$ and $$ C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\} $$ (these are the sets of potentially problematic $x$ values for the denominator of the right-hand-side of \eqref{1}). We have $$ \mu_X(C_0) = \Pr(X \in C_0) = \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0, $$ and $$ \mu_X(C_\infty) = \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = \begin{cases} \infty, & \text{if $\nu(C_\infty) > 0$,} \\ 0, & \text{if $\nu(C_\infty) = 0$.} \end{cases} $$ Since $\mu_X(C_\infty) = \infty$ is impossible ($\mu_X$ is a probability measure), it follows that $\nu(C_\infty) = 0$, whence $\mu_X(C_\infty) = 0$ as well. Thus, $\mu_X(C_0 \cup C_\infty) = 0$, so the set of all $x \in \mathcal{X}$ such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.
Next, consider that, if $A \in \tau$ and $B \in \mathcal{B}$, then $$ \Pr(\Theta \in A, X \in B) = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x) $$ and simultaneously $$ \begin{aligned} \Pr(\Theta \in A, X \in B) &= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\ &= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x). \end{aligned} $$ It follows that $$ \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\nu$-a.e. $x \in \mathcal{X}$, and hence $$ \mu_{\Theta \mid X}(A \mid x) = \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\mu_X$-a.e. $x \in \mathcal{X}$. Thus, for $\mu_X$-a.e. $x \in \mathcal{X}$, $\mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta$, and the Radon-Nikodym derivative is $$ \frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x) = \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)}, $$ as claimed, completing the proof.
Lastly, how do we reconcile the colloquial version of Bayes' theorem found so commonly in statistics/machine learning literature, namely, $$ \tag{2} \label{2} p(\theta \mid x) = \frac{p(\theta) p(x \mid \theta)}{p(x)}, $$ with \eqref{1}?
On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of $\Theta$ given $X$ with respect to some unspecified dominating measure on the parameter space. In fact, none of the dominating measures for the four different densities in \eqref{2} (all named $p$) are explicitly mentioned.
On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of $\Theta$ given $X$ with respect to the prior distribution.
If, in addition, the prior distribution $\mu_\Theta$ has a density $f_\Theta$ with respect to some (let's say $\sigma$-finite) measure $\lambda$ on the parameter space $\Omega$, then $\mu_{\Theta \mid X}(\cdot\mid x)$ is also absolutely continuous with respect to $\lambda$ for $\mu_X$-a.e. $x \in \mathcal{X}$, and if $f_{\Theta \mid X}$ represents a version of the Radon-Nikodym derivative $d\mu_{\Theta\mid X}/d\lambda$, then \eqref{1} yields $$ \begin{aligned} f_{\Theta \mid X}(\theta \mid x) &= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}. \end{aligned} $$ The translation between this new form and \eqref{2} is $$ \begin{aligned} p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\ p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\ p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\ p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)} \end{aligned} $$