Donsker and Varadhan inequality proof without absolute continuity assumption

information theorymachine learningmeasure-theoryprobability theory

I've been attempting to understand the proof of the Donsker-Varadhan dual form of the Kullback-Liebler divergence, as defined by
$$
\operatorname{KL}(\mu \| \lambda)
= \begin{cases}
\int_X \log\left(\frac{d\mu}{d\lambda}\right) \, d\mu, & \text{if $\mu \ll \lambda$ and $\log\left(\frac{d\mu}{d\lambda}\right) \in L^1(\mu)$,} \\
\infty, & \text{otherwise.}
\end{cases}
$$

with Donsker-Varadhan dual form
$$
\operatorname{KL}(\mu \| \lambda)
= \sup_{\Phi \in \mathcal{C}} \left(\int_X \Phi \, d\mu – \log\int_X \exp(\Phi) \, d\lambda\right).
$$

Many of the steps in the proof are helpfully outlined here: Reconciling Donsker-Varadhan definition of KL divergence with the "usual" definition, and I can follow along readily.


However, a crucial first step is establishing that (for any function $\Phi$)
$$\tag{1}\label{ineq}
\operatorname{KL}(\mu\|\lambda)\ge \left\{\int \Phi d\mu-\log\int e^{\Phi}d\lambda\right\},$$

said to be an immediate consequence of Jensen's inequality. I can prove this easily in the case when $\mu \ll \lambda$ and $\lambda \ll \mu$:

$$ \operatorname{KL}(\mu\|\lambda) – \int \Phi d\mu = \int \left[ -\log\left(\frac{e^{\Phi}}{d\mu / d\lambda}\right) \right] d\mu \ge -\log \int \frac{e^{\Phi}}{d\mu / d\lambda} d\mu = -\log\int\exp(\Phi)d\lambda.$$
However, this last step appears to crucially rely on the existence of $d\lambda/d\mu$ and thus that $\lambda \ll \mu$, which isn't assumed by the overall theorem. Where I have been able to find proofs of the above in the machine learning literature, this assumption seems to be implicitly made, but I don't believe it is necessary and it is very restrictive.


My question is: how can we prove \ref{ineq} without assuming $\lambda \ll \mu$?

Best Answer

Notice that if $D(\mu\|\lambda)$ is infinite then there's nothing to show. I'll assume that it is finite in the following.

First, take a bounded $\Phi$. Then $e^\Phi > 0$ everywhere, $\Phi$ is $\mu$-integrable, and $e^\Phi$ is $\lambda$-integrable. Consider the probability measure $\mathrm{d}\lambda' = \frac{e^\Phi}{Z} \mathrm{d}\lambda,$ where $Z = \int e^\Phi \mathrm{d}\lambda$. Notice that $\lambda \ll \lambda',$ and thus a fortiori, $\mu \ll \lambda'$, and $ \frac{\mathrm{d}\mu}{\mathrm{d}\lambda'} = Ze^{-\Phi} \frac{\mathrm{d}\mu}{\mathrm{d}\lambda}.$

Now observe that \begin{align} D(\mu \|\lambda') &= \int \log\left(Z{e^{-\Phi}} \frac{\mathrm{d}\mu}{\mathrm{d}\lambda} \right) \mathrm{d}\mu \\ &= \log Z -\int \Phi \mathrm{d}\mu + D(\mu\|\lambda).\end{align} By Gibbs' inequality, $D(\mu \|\lambda') \ge 0,$ and so we conclude that $$ D(\mu\|\lambda) \ge \int\Phi \mathrm{d}\mu - \log Z = \int \Phi \mathrm{d}\mu - \log \int e^\Phi \mathrm{d}\lambda. $$


This argument generalises directly to $\Phi$ that are bounded below $\lambda$-a.s. (since this is sufficient to get $\lambda \ll \lambda'$).

For an unbounded from below function $\Phi$, approximate the same by decreasing functions $\Phi_n$ converging pointwise to $\Phi$ such that $\Phi_1$ is integrable. For instance, we can take $\Phi_1 = \max(0, \Phi),$ in which case $$ \int e^{\Phi_1} = \lambda(\Phi \le 0) + \int e^{\Phi} \mathbf{1}\{\Phi > 0\} \le 1 + \int e^{\Phi}.$$

Now by the monotone convergence (applied to $e^{\Phi_1} - e^{\Phi_n}$), $\int e^{\Phi_n} \mathrm{d}\lambda \to \int e^{\Phi} \mathrm{d}\lambda$. And, of course, since $\Phi \le \Phi_n, \int \Phi \mathrm{d}\mu \le \int \Phi_n \mathrm{d}\mu.$ But then, for every $n$, \begin{align} D(\mu \|\lambda) &\ge \int \Phi_n \mathrm{d}\mu - \log \int e^{\Phi_n} \mathrm{d}\lambda\\ &\ge \int \Phi \mathrm{d}\mu - \log\int e^{\Phi_n} \mathrm{d}\lambda, \end{align} and the conclusion follows on taking limits (and using the continuity of $\log$).