Convexity of Relative entropy for probability measures with no densities

entropyinformation theoryprobability theoryrenyi-entropy

I am trying to prove convexity of the relative entropy for general measures (without using densities wrt the Lebesgue measure but just Radon Nykodim derivatives).

Given two measures, $\mu$ and $\nu$, define the relative entropy in the usual way: if $\mu \ll \nu$ we have

$$\mathcal{H}(\mu\mid\mid\nu)=\int log(\frac{d\mu}{d\nu})d\mu$$

Otherwise, $\mathcal{H}(\mu \mid\mid \nu)=\infty$.
Now, I'd like to show that for every $\alpha \in (0,1)$ and every four probability measures $\mu_1,\mu_2, \nu_1,\nu_2$

we have
$$\mathcal{H}(\alpha \mu_1+(1-\alpha)\mu_2 \mid \mid \alpha \nu_1+(1-\alpha)\nu_2) \leq\alpha \mathcal{H}(\mu_1\mid\mid\nu_1)+(1-\alpha)\mathcal{H}(\mu_2\mid\mid\nu_2)$$

with strict inequality if $\mu_i \ll \nu_i$ for every $i$ and $\alpha \mu_1+(1-\alpha)\mu_2 \ll \alpha \nu_1+(1-\alpha)\nu_2$.

I know this is a common and known result, but all the proofs that I can find online assume that the measures admit a density wrt a common reference measure (e.g. Lebesgue measure), while I'd like the proof to work without such assumption.

A reference for this proof (and other general properties for the relative entropy in the general case) would be deeply appreciated.

Thanks!

Best Answer

Fix measures $\mu_0,\mu_1,\nu_0,\nu_1$ on the same probability space with $\mu_0\ll\nu_0$ and $\mu_1\ll\nu_1$, i.e. the interesting case. For $\alpha\in[0,1]$ let $\mu_\alpha=(1-\alpha)\mu_0+\alpha\mu_1$ and $\nu_\alpha=(1-\alpha)\nu_0+\alpha\nu_1$. Notice that $\mu_\alpha\ll\nu_\alpha$.

For the following argument, I prefer to work with random variables, so let $B_\alpha\in\{0,1\}$ be Bernoulli with success probability $\alpha$. Let $X_\alpha$ given $B_\alpha$ have the law $\mu_{B_\alpha}$, and let $Y_\alpha$ given $B_\alpha$ have the law $\nu_{B_\alpha}$.

Using this machinery, I can simply write $\mathcal H(X_\alpha\|Y_\alpha)=\mathcal H((1-\alpha)\mu_0+\alpha\mu_1\|(1-\alpha)\nu_0+\alpha\nu_1)$. The right hand side of the inequality is the conditional relative entropy of $X_\alpha$ given $B_\alpha$ and $Y_\alpha$ given $B_\alpha$. However, since $\mathcal H(B_\alpha\|B_\alpha)=0$ is clearly trivial, the conditional relative entroy equals the joint relative entropy, that is, we have $\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=(1-\alpha)\mathcal H(X_0\|Y_0)+\alpha\mathcal H(X_1\|Y_1)$, which we verify by establishing that the Radon-Nikodym derivative of $(B_\alpha,X_\alpha)$ with respect to $(B_\alpha,Y_\alpha)$ coincides with the $(X_0,Y_0)$-derivative on $B_\alpha=0$ and with the $(X_1,Y_1)$-derivative on $B_\alpha=1$.

Now, let us introduce some notation for the conditional relative entropy, say $\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)$, so we just discussed that $\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=\mathcal H(B_\alpha\|B_\alpha)+\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)=\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)$, by the chain rule for the relative entropy. This chain rule goes two ways, so we also have $\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=\mathcal H(X_\alpha\|Y_\alpha)+\mathcal H(B_\alpha|X_\alpha\|B_\alpha|Y_\alpha)$. So, the convexity of the relative entropy is just a special case of the chain rule, using that the conditional relative entropy $\mathcal H(B_\alpha|X_\alpha\|B_\alpha|Y_\alpha)\ge 0$ is non-negative.

Unfortunately, the conditional relative entropy is usually only defined for finite supports, using kernels or under other restrictions, since it seems that we may need access to some kind of conditional distribution. However, this is not true, the conditional relative entropy can be defined in general, as discussed here, namely using the chain rule for the definition. Non-negativity is also shown here, which amounts to a straightforward application of Jensen's inequality.

Related Question