[Math] Understanding Gibbs’s inequality

ca.classical-analysis-and-odesinequalitiesit.information-theoryreal-analysis

Short version

Gibbs's inequality is a simple inequality for real numbers, usually
understood information-theoretically. In the jargon, it states that
for two probability measures on a finite set, the relative entropy is
always nonnegative.

I'd like to hear about non-information-theoretic ways of understanding it.
I'd be particularly pleased if there were some nice geometric
interpretation.

Statement and proof of Gibbs's inequality

For natural numbers $n$, let $\mathbf{P}_n$ denote the set of probability
measures on an $n$-element set: that is,
$$
\mathbf{P}_n = \{ p \in \mathbb{R}^n : p_1, \ldots, p_n \geq 0, \sum p_i =
1 \}.
$$
Theorem (Gibbs) Let $p \in \mathbf{P}_n$. Then, for $q$
varying in $\mathbf{P}_n$, the quantity $\prod q_i^{p_i}$ is maximized by
$q = p$.

Usually this is stated in logarithmic form: $-\sum p_i \log q_i \geq -\sum
p_i \log p_i$ for all $p, q \in \mathbf{P}_n$. But I'd like to reach a direct understanding of the product form.

There are at least two extremely easy proofs. Ignoring zero probabilities, they run as follows. The first: since $\log$ is
concave, $\sum p_i \log (q_i/p_i) \leq \log \sum p_i (q_i/p_i) = 0$. The
second: since $\log x \leq x – 1$ for all $x$, we have $\sum p_i \log
(q_i/p_i) \leq \sum p_i (q_i/p_i – 1) = 0$.

The question

Can Gibbs's inequality, in the product form stated above, be understood
geometrically? Or if not geometrically, is there an intuitive
interpretation other than the information-theoretic one? (I have nothing
against information theory — it's just that I'd like to have multiple ways
of thinking about it.)

There is a hint that Gibbs's inequality can be interpreted as some kind of
isoperimetric inequality. Take $p$ to be the uniform
distribution. Then the inequality states that for $q \in \mathbf{P}_n$,
the quantity $(q_1 q_2 \cdots q_n)^{1/n}$ is maximized by taking $q$ to be
uniform. We might as well remove the power $1/n$, and then the result is:
among all $n$-dimensional boxes of prescribed total edge-length, the cube
has the greatest volume.

But I see no way of extending the isoperimetric interpretation to
non-uniform $p$. For example, take $p = (2/3, 1/3)$. Then Gibbs
states that among all $q \in \mathbf{P}_2$, the maximum value of $q_1^2
q_2$ is attained by $q = (2/3, 1/3)$. This doesn't seem geometrically obvious to me in the way that the uniform case does.

Best Answer

The trick is to use appropriate units or scaling for the different edges of the rectangular parallelopiped when computing its volume. More specifically, apply the uniform probability argument (i.e., the isoperimetric inequality) to the probabilities $$\tilde{p}_i = \frac{1}{n}\text{ and } \tilde{q_i} = \frac{q_i}{p_i}\left(\sum \frac{q_i}{p_i}\right)^{-1}. $$

Related Question