Minimizing cross entropy between reference and target distributions over a restricted domain.

convex-analysisinformation theoryoptimizationprobabilityreal-analysis

My question is related to this post. Suppose I am minimizing the cross entropy between target and reference distributions in a specific domain $\Xi$ as follows:

$$\max_{p \in [0,1]^n} \sum_{x \in \Xi} f(x;q) \log f(x;p),$$

where $f(x;q) = \Pi_{i=1}^{n} f(x_i;q_i)$ is the joint Bernoulli law. The optimal solution is $p_i = E_{X \sim q} [X_i|X \in \Xi]$ shown in the previous post. I would like to show that, the total probability mass located on $x \in \Xi$ by the optimal reference distribution $f(x;p^{*})$ will be greater than or equal to the total probability mass located on $x \in \Xi$ domain by the target distribution $f(x;q)$. Mathematically, I need to show the following:

$$\sum_{x \in \Xi} f(x;p^{*}) \geq \sum_{x \in \Xi} f(x;q).$$

Intuitively, cross entropy minimization is equivalent to the maximum likelihood estimation, thus, the statement should hold true (at least under some conditions which I may overlook). Could you please help me to prove the above statement mathematically?

Best Answer

Let $F(\Xi;p) := \sum_{x \in \Xi} f(x;p),$ and $f(x;p|\Xi) := f(x;p)/F(\Xi;p)$, where the latter is the conditional law of $p$ given that $x$ lies in $\Xi$. You want to show that $F(\Xi;p^*) \ge F(\Xi;q)$. For this, first note that we can assume that for any $x \in \Xi, f(x;q) > 0,$ since otherwise this $x$ doesn't enter the optimisation anyway (equivalently, we can work with $\Xi' = \Xi - \{x : f(x;q) = 0\}$. Also note that $F(\Xi;p)> 0$ for any $p$ that optimises the objective, since otherwise $f(x;p) = 0$ for each $x$ and the optimal value would be $-\infty,$ but we know that $\sum f(x;q) \log f(x;q) > -\infty$.

Now, due to the optimality of $p^*,$ we know that \begin{align} \sum_\Xi f(x;q) \log \frac{f(x;p^*)}{f(x;q)} &\ge 0 \\ \iff \sum_{\Xi} f(x;q) \log \frac{f(x;p^*|\Xi) F(\Xi;p^*)}{f(x;q|\Xi) F(\Xi;q)} &\ge 0 \\ \iff \sum_\Xi f(x;q|\Xi) \log \frac{F(\Xi;p^*)}{F(\Xi;q)} & \ge \sum_\Xi f(x;q|\Xi) \log \frac{f(x;q|\Xi)}{f(x;p^*|\Xi)}\\ \iff \log \frac{F(\Xi;p^*)}{F(\Xi;q)} &\ge D(f(\cdot;q|\Xi) \|f(\cdot; p^*|\Xi) \ge 0,\end{align} and the conclusion follows. Here the first step is Bayes' law, the second step uses properties of the $\log,$ and factors out $F(\Xi;q)$ from each $f(x;q)$, the final step is the standard nonnegativity of KL divergence (Gibbs' inequality).

Note that this doesn't use any specific property of $f,$ so the argument is completely generic.

Related Question