Prove that the Normal (Gaussian) Distribution with a given Variance $ {\sigma}^{2} $ maximizes the Differential Entropy among all distributions with defined and finite 1st Moment and Variance which equals $ {\sigma}^{2}
$.
[Math] Differential Entropy of Random Signal
entropyit.information-theorypr.probabilityprobability distributions
Related Solutions
This follows (modulo any minor technical details I haven't checked) from the theory of exponential families. The main result there says that the distribution which maximizes entropy subject to constraints on moments lies in the exponential family with sufficient statistics corresponding to these moments.
More concretely, let $n$ be distributed according to the distribution $\pi$ on $\{0,1,\ldots\}$. You constrain $\pi$ to satisfy $\mathbb{E}_\pi n = \mu$ and $\mathbb{E}_\pi n^2 = \sigma^2 + \mu^2$ (using the standard relation between the variance and uncentered second moment). Then $\pi$ must be of the form \[ \pi(n) = \frac{1}{Z}\exp\left(\alpha n + \beta n^2\right)\text{ for all }n, \] where $\alpha,\beta\in\mathbb{R}$ and $Z = \sum_n \exp\left(\alpha n + \beta n^2\right)$ is the constant that ensures this distribution normalizes. (Actually we can say $\beta\leq 0$ since otherwise we'd have $Z=\infty$ and the distribution wouldn't be normalizable. Similarly $\alpha<0$ if $\beta = 0$.)
So the problem reduces to one of finding $\alpha$ and $\beta$ given $\mu$ and $\sigma^2$, i.e. we've gone from needing to find infinitely many values $\pi(n)$ to only two. In this case I have a feeling that there is no closed form for $Z$ and so finding an explicit expression for $\alpha$ and $\beta$ in terms of $\mu$ and $\sigma^2$ is unlikely. However, there are optimization techniques for "moment matching" which will let you approximate these numerically.
In the simpler case when only the mean is constrained, things work out more nicely. If you go through the same sort of procedure but without the $\beta n^2$ term, you'll get a $Z$ which you can sum explicitly and which is finite when $\alpha < 0$. You can then relate $\mu$ and $\alpha$. The result is a geometric distribution with parameter $\frac{1}{\mu}$.
We can find some attempts for generalizing the notion of entropy of discrete random variables to random variables with general distribution function.
A straightforward way is to employ Riemann sum of the distribution function. So we start with a discrete random variable and then by making the intervals small enough, the entropy function is obtained. Denote the quantized random variable by $X_\delta$ where $\delta$ is the size of the intervals. If the probability density function $f$ is integrable, we can see for small $\delta$ (Cover-Thomas p. 248): $$ H(X_\delta)\approx h(X)-\log\delta. $$ By choosing $\delta$ equal to $2^{-n}$, i.e. $n$ bit quantization, we get
$$ H(X_\delta)\approx h(X)+n $$ which represents how many bits we need to describe $X$ with $n$ bit accuracy. This shows somehow the relation between differential entropy and discrete entropy. Note that when $\delta\to 0$, $H(X_\delta)\to\infty$.
Another point is that the mutual information does not change using this method, namely if $\delta\to 0$: $$ I(X_\delta;Y_\delta)=I(X;Y). $$
The generalization is attributed to different people, among them mainly Kolmogorov and Rényi:
A. N. Kolmogorov. On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. Inf. Theory, IT-2:102–108, Sept. 1956.
J. BALATONI and A. RENYI, Remarks on entropy (in Hungarian with English and Russian summaries), Publications of the Mathematical Institute of the Hungarian Academy of Sciences, I (1956), pp. 9--40.
Renyi introduced the following random variable ($[]$ is the integer part) $$ X_n=\frac1{n}[nX]. $$ Note that this is nothing but looking at the intervals $[\frac kn,\frac{k+1}n)$. Suppose that $H([X])$ exists, which is denoted by $H_0(X)$ in the original paper. The lower dimension of $X$ is defined as following $$ \underline d(X)=\liminf_{n\to\infty}\frac{H([X])}{\log n} $$ and upper dimension of $X$ as: $$ \overline d(X)=\limsup_{n\to\infty}\frac{H([X])}{\log n}. $$ Now if $\overline d(X)=\underline d(X)$, we simply talk about the information dimension of $X$, $d(X)$ and we define the following: $$ H_{d(X)}(X)=\lim_{n\to\infty} (H(X_n)-d(X)\log n). $$ Renyi proved that if $X$ has an absolutely continuous distribution with the density function $f(X)$ and finite $H([X])$, then we can say: $$ d(X)=1\\ H_1(X)=h(X). $$ This is what we discussed above for $\delta=\frac 1n$: $$ H(X_n)=h(X)+\log n $$ Kolmogorov instead introduced the notion of $\epsilon-$entropy which is defined for random variables in abstract metric spaces which is more general.
To answer your question, We can keep the same intuition as the discrete case for differential entropy at least when we use it for finding mutual information or KL-divergence.
For the entropy itself, we have to alter our intuition a little bit. The entropy of discrete random variable means the minimum bits we need to compress the random variable. But for random variables with uncountable supports, we can always "compress" it with another uncountable set of same cardinality (any one-to-one and onto mapping does that). But different random variables with uncountable supports can have different differential entropy.
Best Answer
Cover and Thomas's book is indeed the right place to learn about this.
The statement basically follows by convexity, in the form of Jensen's inequality. Here is the way it is usually presented:
Let $f$ be the probability density of a real random variable. Then the Shannon entropy is given by
$-\int f\log f dx$
You want to prove among all real random variables with finite Shannon entropy and variance equal to $1$, the Shannon entropy is maximized only for Gaussians.
Given two probability densities $f$ and $g$, since $\log$ is a concave function, Jensen's inequality tells us that
$\int f \log (g/f) dx < \log \int f(g/f) dx = \log \int g dx = 0$
Moreover, since $\log$ is strictly concave, equality holds if and only if $g = f$. If you now set $g$ equal to the probability density of a Gaussian with the same variance as $f$ and plug in the explicit formula for $g$, you get what you want.