Bayes’ Theorem Inequality – Optimality and Variational Inequalities

entropyit.information-theorypr.probabilityst.statisticsvariational-inequalities

$\DeclareMathOperator\Ent{Ent}\newcommand{\prior}{\mathrm{prior}}\newcommand\Data{\mathrm{Data}}$I came across this paper on the optimality of Bayes' theorem
https://sinews.siam.org/Portals/Sinews2/Issue%20Pdfs/sn_July-August2021.pdf
and could not figure out where one inequality comes from.

Denote by $m$ the parameter to estimate and by $D_{KL}$ the KL distance.

One statement at the end is that from "a duality formulation of variational inference" in a given reference, the following inequality holds
$$
– \log E_{\pi_{\prior}}[e^{f(m)}] \leq E_\rho[f(m)] + D_{KL}(\rho \mid \pi_{\prior}) \;.
$$

with mild conditions on $\pi_{\prior}(m)$ and $\rho(m)$ and a large class of $f(m)$.
(When $f(m) = \log_{\pi_\mathrm{like}}(\Data\mid m)$, the equality holds only when $\rho=$ Bayes posterior distribution.)

Q: Where does the inequality comes from? In the reference http://www.cmap.polytechnique.fr/~merlet/articles/probas_massart_stf03.pdf ,
Section 2.3.1 Duality and variational formulas gives
$$
\Ent_P[Y] = \sup \{ E_P[UY] , \, U:\Omega\rightarrow \bar{R},\, E[e^U] = 1 \} \;.
$$

I still don't see how this helps with deriving the inequality.

Best Answer

First, let us do some cleaning here.

  1. Let $\pi:=\pi_{\mathrm{prior}}$, $Y:=\rho/\pi$, and $F:=f(m)$.
  2. You copied the inequality in question incorrectly. In your first-linked paper, the inequality is $-\ln E_\pi e^F\le E_\rho(-F) + D_{KL}(\rho\mid \pi)$, that is, $$E_\rho F-\ln E_\pi e^F\le D_{KL}(\rho\mid\pi). \tag{0}$$
  3. Your first-linked paper uses some facts from your second-linked book. However, the definition of "the Kullback–Leibler information" in that book is non-standard: it is $$K(\rho,\pi):=E_\pi Y\ln Y-E_\pi Y\,E_\pi\ln Y=E_\pi(Y\ln Y-\ln Y) =E_\rho\ln\frac\rho\pi+E_\pi\ln\frac\pi\rho,$$ which is the symmetrized version of the standard Kullback–Leibler (KL) information $D_{KL}(\rho\parallel\pi)=E_\rho\ln\frac\rho\pi=E_\pi Y\ln Y$ (your first-linked paper uses the standard definition of the KL information). To avoid further confusion, we are not going to use the book in the rest of this answer.

Thus, the corrected version of your inequality in question is this: $$E_\rho F-\ln E_\pi e^F\le E_\pi Y\ln Y$$ or, equivalently, $$E_\pi YF-\ln E_\pi e^F\le E_\pi Y\ln Y. \tag{1}$$

Since $E_\pi Y=1$, the left-hand side of (1) will not change if $F$ is replaced by $F-c$, for any real constant $c$. So, without loss of generality $E_\pi e^F=1$, and then (1) becomes $$E_\pi YF\le E_\pi Y\ln Y. \tag{2}$$ By convexity,
$$e^F\ge e^u+(F-u)e^u$$ for all real $u$. Substituting here $u=\ln Y$ and taking the expectations, we get $$E_\pi Y=1=E_\pi e^F\ge E_\pi Y+E_\pi (F-\ln Y)Y,$$ so that (2) follows.

So, (0) holds, as desired.

Of course, (0) will continue to hold with $K(\rho,\pi)$ in place of $D_{KL}(\rho\mid\pi)$, because the symmetrized KL information $K(\rho,\pi)$ is no less than the standard KL information $D_{KL}(\rho\mid\pi)$.

Related Question