Bayes theorem and Radon-Nikodym derivative

bayes-theoremconditional probabilityconditional-expectationinformation theory

I am dealing with introduction to information theory and there is a problem I am not able to deal with. I will use the author's notation.

As reported in [Information theory and statistics, Kullback], he rewrites, on $(\Omega, \mathcal{f})$ in which two absolute continuous measure functions $\mu_1$ and $\mu_2$ are defined, the Bayes theorem as $$\mathbb{P}(H_i|x)=\frac{\mathbb{P}(H_i)f_i(x)}{\mathbb{P}(H_1)f_1(x)+\mathbb{P}(H_2)f_2(x)},\qquad i=1,2$$ where $f_i(x)$ is the Radon-Nikodym derivative with respect to the measure $\mu_i$: $\mu_i(E)=\int_E f_i(x)d\lambda(x)$, where $\lambda$ is a probability measure which is equivalent to $\mu_1$ and $\mu_2$.

I can't understand why the author rewrites the conditional probability $\mathbb{P}(x|H_i)$ as the Radon-Nikodym derivative $f_i(x)$.
I am pretty sure that the "abstract" Bayes formula (with conditional expectations) is involved, but actually I cannot understand why $f_i(x)=\mathbb{P}(x|H_i)$.

Many thanks.

Best Answer

The RN derivative is basically the same as a pdf (this is exactly true when $\lambda$ is the Lebesgue measure and $X$ is a continuous RV). The standard heuristic about continuous random variables is that since $P(X = x)$ is $0$ for every $x,$ an observation of $x$ should really be interpreted as $X \in x \pm \delta x$ for $\delta x \to 0.$ By definition of the pdf, $P(X \in x \pm \delta x) = f(x) \cdot (2\delta x) + o(\delta x),$ where $f$ is the pdf.

With this in hand, we can see that the above is just an instantiation of the usual Bayes' law - for any events $A$ and $B_1,B_2$ (all with non-zero probability), $$P(B_i|A) = \frac{P(A|B_i) P(B_i)}{ P(A|B_1) P(B_1) + P(A|B_2) P(B_2)}.$$

Indeed, the events $B_1$ and $B_2$ are just $H_1$ and $H_2$ above, and $X$ has the pdf $f_1$ under $H_1$ and $f_2$ under $H_2$. From the above heuristic, we should have $$P(H_i|X=x) = \lim_{\delta x \to 0}\frac{(f_i(x) (2\delta x) + o(\delta x))P(H_i)}{(f_1(x) (2\delta x) + o(\delta x))P(H_1) + (f_2(x) (2\delta x) + o(\delta x))P(H_2)} \\ = \lim_{\delta x \to 0} \frac{f_i(x) P(H_i) + o(1)}{f_1(x) P(H_1) + f_2(x) P(H_2) + o(1) } = \frac{f_i(x) P(H_i)}{f_1(x) P(H_1) + f_2(x) P(H_2)}$$

The heuristics above can be made rigorous with ease, but require machinery to be set up. For most applied contexts this truly is irrelevant details.

Pertaining specifically to the problem in your Q, $P(x|H_i),$ or more precisely $P(X = x|H_i)$ is not interpreted as $f_i$, but the ratio $P(X = x|H_1)/P(X = x|H_2)$ can formally be expressed as $f_1(x)/f_2(x)$. (Of course, this is a $0/0$ form, so must be interpreted in some limiting sense. The sense appropriate is well expressed by the above heuristic).

Related Question