Probability Theory – Why Relative Entropy Decreases Under Pushforward

entropymeasure-theoryprobability theory

I am reading the paper at https://arxiv.org/abs/1006.3028 (J. Lehec, "Representation formula for the entropy and functional inequalities"). The main concept here is the relative entropy of the probability measures $\mu$ and $\gamma$, defined as
$$H(\mu | \gamma)=\int \log\left( \frac{d\mu}{d\gamma}\right) d\mu, $$
or $+\infty$ if $\mu$ is not absolutely continuous with respect to $\gamma$ (that is, the density $\frac{d\mu}{d\gamma}$ does not exist). This is also known as the Kullback-Liebler divergence.


Remark on sign conventions. This definition seems to be more common of information theory. With this definition, $H(\mu| \gamma)$ is a nonnegative convex function of $\mu$. The common physicist's definition, on the other hand, has the opposite sign; it is thus a nonpositive concave function of $\mu$.


The first inequality in the second section reads
$$\tag{1}
H(\mu\circ T^{-1} | \gamma\circ T^{-1})\le H(\mu | \gamma)$$

for all measurable maps $T$.

Main question. What is the fastest proof of (1)?

Following the references in the paper I actually found a proof. In the book "Large deviations and applications" of Varadhan (reference [24], section 10) I see that the relative entropy can be characterized as
$$
H(\mu|\gamma)=\inf\left\{ c\,:\, \int F\, d\mu \le c + \log \int e^F\, d\gamma,\ \forall F \text{ bounded and measurable}\right\}.$$

Using this characterization, (1) follows. I wonder if there is a way to avoid the characterization, though.


NOTE. The characterization is an immediate consequence of the convex duality described in this question, which is an application of the Jensen inequality.


Secondary question. The word "entropy" makes me think of the second principle of thermodynamics, and it suggests some quantity that is monotonic in time. Now, the map $\mu\mapsto \mu\circ T^{-1}$ can be interpreted as a step in time for the discrete dynamical system $x\mapsto T(x)$. Can (1) be seen as a version of the second principle of thermodynamics for such discrete systems?

Best Answer

Suppose $\mu$ and $\gamma$ are probability measures on $(X,\mathscr{F})$, $\mu\ll\gamma$, and $T:(X,\mathscr{F})\rightarrow(Y,\mathscr{G})$ measurable.

Then of course $\mu\circ T^{-1}\ll\gamma\circ T^{-1}$, for $\gamma\circ T^{-1}(A)=\gamma(T^{-1}(A))=0$, implies $\mu(T^{-1}(A))=\mu\circ T^{-1}(A)=0$.

Claim:
$$ \mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T$$

Let $h:(Y,\mathscr{G})\mapsto(\mathbb{R},\mathscr{B}(\mathbb{R})$ be a measurable function such that $E_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=h\circ T$ (any function $\phi$ that is measurable with respect to $\sigma(T)$ admits a representation for the form $\phi=h_\phi\circ T$ for some measurable function $h$ on $Y$). Then, for any $B\in\mathscr{G}$, $$\begin{align} \int_Y \mathbb{1}_B\,\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\, d(\gamma\circ T^{-1})&=\int_Y \mathbb{1}_B\,d(\mu\circ T^{-1})=\int_X \mathbb{1}_B\circ T\,d\mu\\ &=\int_X\mathbb{1}_{T^{-1}(B)}\frac{d\mu}{d\gamma}\,d\gamma=\int_X\big(\mathbb{1}_{B}\circ T \big)\,\mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]\,d\gamma\\ &=\int_X \big(\mathbb{1}_B\circ T\big)\, h\circ T\,d\gamma =\int_Y\mathbb{1}_B\,h\,d(\gamma\circ T^{-1}) \end{align} $$ This proves that (1) $(\gamma\circ T^{-1})$-almost surely $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}=h$, and so, (2) $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T=\mathbb{E}_\gamma\big[\frac{d\mu}{d\gamma}\big|\sigma(T)\big]$
$\Box$

Let $\eta(x)=x \log(x)\mathbb{1}_{(0,\infty)}(x)$ on $[0,\infty)$. It is easy to check that $\eta$ is convex on $[0,\infty)$ , and that for any pair of measures $\mu$, $\gamma$ with $\mu\ll\gamma$ $$H(\mu|\gamma):=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,d\mu=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,\frac{d\mu}{d\gamma}\,d\gamma=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma$$ Finally, applying Jensen's inequality to conditional expectations yields

$$\begin{align} H(\mu\circ T^{-1}|\lambda\circ T^{-1})&=\int_Y\eta\left(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\right)\,d(\gamma\circ T^{-1})\\ &=\int_X\eta\Big(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\circ T\Big)\,d\gamma\\ &=\int_X\eta\Big(\mathbb{E}_\gamma\big[ \frac{d\mu}{d\gamma}\big|\sigma(T)\big]\Big)\,d\gamma\\ &\leq\int_X\mathbb{E}_\gamma\big[ \eta\big(\frac{d\mu}{d\gamma}\big)\big|\sigma(T)\big]\,d\gamma\\ &=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma=H(\mu|\gamma) \end{align}$$ which is the desired inequality.

Related Question