[Math] Does a maximum entropy probability distribution with KL-divergence constraint not exist

calculus-of-variationsentropyinformation theoryoptimization

In my earlier question I asked about a technical aspect of solving a system of equations arising from looking for an entropy-maximizing distribution $p(x)$ continuous on $\mathbb{R}$ and constrained by KL-divergence with a zero-mean Gaussian distribution. That is, in addition to the usual probability density and variance constraints, I have the following constraint for $p(x)$:

$$D(p_N(y)\|p(y))=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}\sigma_N}e^{-y^2/2\sigma^2}\log\frac{\frac{1}{\sqrt{2\pi}\sigma}e^{-y^2/2\sigma^2}}{p(y)}dy<\epsilon$$

Thanks to user anon, the form of the function $p(y)$ was found, but it is not a density function and now I am trying to interpret why is this the case.

First, here is the system of equations (copied from the earlier question) that I derived using Calculus of Variations (and help from Gallager's "Information Theory and Reliable Communication":

$$\begin{align}
0&=\log(p(y))+1-\lambda-\gamma y^2-\eta \left(\frac{e^{-y^2/2}}{\sqrt{2\pi}}\right)\left(\frac{1}{p(y)}\right)\\
0&=1-\int_{-\infty}^{\infty}p(y)dy\\
0&=1-\int_{-\infty}^{\infty}y^2p(y)dy\\
0&=c+\int_{-\infty}^{\infty}\frac{e^{-y^2/2}}{\sqrt{2\pi}}\log(p(y))dy
\end{align}
$$

(for simplicity I set $\sigma=1$; $c=\epsilon+\frac{1}{2}\log(2\pi )$)

From anon's helpful comment, we can actually solve the first equation in terms of Lambert W function to obtain the following:

$$p(x)=\frac{\eta e^{-y^2/2}}{\sqrt{2\pi}W(e^{-(1+2\gamma)y^2/2+(1-\lambda)})}$$

When $|y|\rightarrow\infty$, $e^{-ay^2+b}\rightarrow 0$, and since $W(0)=0$, $p(y)\rightarrow\infty$. Thus, this is obviously not a pdf!

This is entirely due to the KL-divergence constraint (very similar situation arises when variance constraint is removed). How does one explain this? There are obviously probability distributions that meet the KL-divergence constraint (e.g. a Gaussian with appropriately picked variance). Does this mean that optimal distribution does not exist, and all distributions one can try would be sub-optimal? Is there a rigorous explanation for this?

Perhaps I did something wrong? Is there another method I should have employed?

Best Answer

You established the indeterminate form $0/0$; from this you cannot casually conclude $p\to\infty$. In the last line of my comment that you cite, I used the trick $e^{W(t)}=t/W(t)$ in order to simplify the expression - but if you want to look at it asymptotically, you can also analyze the prior form: $$p(y)=\exp(-\alpha+W(e^{\alpha}\beta)),$$ where $\alpha=1-\lambda-\gamma y^2$ and $\beta=\eta e^{-y^2/2}/\sqrt{2\pi}$. Now if $\gamma\in[-1/2,0)$, we get the form $p=e^{-\infty+0}$, so we find that $p\to0$. For $\gamma<-1/2$, you'll have to find a way to show $$W(ae^{-(1/2+\gamma)y^2})+\gamma y^2\to-\infty \text{ as } |y|\to\infty.$$ Above $a>0$ is arbitrary (and $\eta e^{1-\lambda}/\sqrt{2\pi}$ in particular). Note $W(x)>1$ for $x>e$, so we have $$e^W\le W e^W =x$$ hence $W\le\log x$ for sufficiently large $x$, which makes the above $-O(y^2)\to-\infty$ (allowing notational sloppiness), proving $p\to0$ as $|y|\to\infty$ whenever we have $\gamma<0$.


This only addresses your claim that $p$ doesn't vanish at the extremes. For your broader questions concerning maximum entropy, KL divergence, or the original optimization problem, I really have no idea.