Minimizing Differential Entropy of a Gaussian Random Variable Conditioned on Sum of Gaussian and Non-Gaussian Random Variables

entropyinformation theorynormal distributionprobability

Let $A$ be a normal distribution with variance $\sigma_A^2$ and $B$ be a continuous random variable with variance $\sigma_B^2$. Here, $A$ and $B$ are independent. Is the Gaussian distribution for $B$ minimizes the differential entropy $h(A|A+B)$?

Best Answer

First notice that $I(A;A+B) = h(A) - h(A|A+B)$. Here $h(A)$ is fixed (since the law of $A$ is fixed), and so minimising the conditional entropy in the question is equivalent to maximising this mutual information. In other words, you're asking a sort of dual question to the channel coding problem: "given that I'm going to feed an additive channel a Gaussian input, what noise distribution is the most benign." The answer here is not a Gaussian $B$ - in a rough sense, the Gaussian is something like a worst-case additive noise law, due to its entropy maximisation. This means that more concentrated noise laws should yield better performance.

Concretely, first think of a noise distribution that is discrete, say distributed on $\pm \beta$ such that (for simplicity) $\beta$ has finite binary expansion. In this case, the noise can only corrupt the first few bits of (the binary expansion of) a real number input $a$, and so we can transmit an arbitrary amount of information in the tail of its binary expansion. Now, this qualitative fact has to basically remain true even if we smear out the discreteness over a tiny set in order to satisfy your continuity requirement. Thus, under such noise, we should attain very high mutual information.

Below I'll formalise this intuition.


For simplicity, just consider the case $\sigma_A^2 = \sigma_B^2 = 1.$ If $B$ is a Gaussian, then it's a simple matter of computation that the mutual information is $\frac12 \log(2)$.

Now, for $\beta, \delta \in [0,1]$ consider $p_B$ of the form $$ p_B(b;\beta,\delta) := \frac1{4\delta} ( \mathbf{1}\{ |b-\beta|\le \delta\} + \mathbf{1}\{|b+\beta| \le \delta\}).$$ This puts mass uniformly in a window of width $2\delta$ about both $+\beta$ and $-\beta$. Equivalently, you can think of $B = Z+N$, where $Z$ is uniform on $\pm \beta,$ and $N$ is uniform on $[-\delta, \delta]$. This pair satisfies the variance condition if $\beta^2 + \delta^2/3 = 1.$

Now, $$I(A;A+B) = h(A+B) - h(A+B|A) = h(A+B) - h(B) \ge h(A) - h(B),$$ where we have used the independence of $A$ and $B$, and the final inequality uses the fact that $0 \le I(B;A+B) = h(A+B) - h(A),$ again using independence.

This means that under the above noise distribution, we have $$ I(A; A+B) \ge \frac{1}{2} \log (2\pi e) - h(B).$$ It suffices to argue that we can choose $\beta, \delta$ such that $$ \frac12 \log (2\pi e) - h(B) \ge \frac12 \log 2 \iff h(B) \le \frac12 \log(\pi e).$$

But the differential entropy of $B$ is driven entirely by $\delta.$ Indeed, we have $$ h(B) = -\frac1{4\delta} \int_{-\beta - \delta}^{-\beta + \delta} \log \frac1{4\delta} - \frac1{4\delta} \int_{\beta - \delta}^{\beta + \delta} \log \frac1{4\delta} = \log 4\delta.$$ So, as long as $\delta$ is small, say $\le 1/4,$ the mutual information $I(A;A+B)$ exceeds $\frac12 \log 2,$ that achievable via a Gaussian $B$.


Note that the above style of example (bilevel-discrete with a skinny continuous noise) continues to work no matter what distribution you pick for $A$ - since $h(B) \to -\infty$ as $\delta \to 0,$ no matter what $h(A)$ is, we can pump the mutual information $I(A;A+B)$ arbitrarily high by using a noise distribution like the above. However, if the noise distribution was Gaussian, then the capacity achieving input distribution is Gaussian, so the maximum mutual information with Gaussian $B$ remains bounded. I think the more natural conjecture might be that a Gaussian $B$ minimises $I(A;A+B)$, but I don't know how difficult this is to show.