[Math] How is the entropy of the normal distribution derived

entropynormal distribution

Wikipedia says the entropy of the normal distribution is $\frac{1}2 \ln(2\pi e\sigma^2)$

I could not find any proof for that, though. I found some proofs that show that the maximum entropy resembles to $\frac{1}2+\ln(\sqrt{2\pi}\sigma)$ and while I see that this can be rewritten as $\frac{1}2\ln(e\sigma\sqrt{2\pi})$, I do not get how the square root can be get rid of and how the extra $\sigma$ can be put into $\ln$. It is clear that an additional summand $\frac{1}2\ln(\sigma\sqrt{2\pi})$ would help, but where do we get it from? Probably just thinking in the wrong way here…

So, what is the proof for the maximum likelihood entropy of the normal distribution?

Best Answer

Notice that $\ln(\color{blue}{\sqrt{\color{black}{x}}}) = \ln(x^{\color{blue}{\frac{1}{2}}}) = \color{blue}{\frac{1}{2}}\ln(x)$ and that $\ln(y) \color{red}{+} \ln(z) = \ln(y \color{red}{\cdot} z)$ for all $x,y,z > 0$. Using these identities, let us re-write the maximum entropy, $\frac{1}{2} + \ln(\sqrt{2\pi}\sigma)$, as follows: $$ \begin{align} \frac{1}{2} + \ln(\sqrt{2\pi}\sigma) &= \frac{1}{2} + \ln(\color{blue}{\sqrt{\color{black}{2\pi\sigma^2}}}) \\ &= \frac{1}{2} + \color{blue}{\frac{1}{2}}\ln(2\pi\sigma^2) \\ &= \frac{1}{2}(1 + \ln(2\pi\sigma^2)) \\ &= \frac{1}{2}(\ln(\mathrm{e}) \color{red}{+} \ln(2\pi\sigma^2)) = \frac{1}{2}\ln(\mathrm{e}\color{red}{\cdot}2\pi\sigma^2) \end{align} $$ So, the entropy reported in Wikipedia is correct.

Related Solutions

Equivalence Between Uniform and Normal Distribution

This is only an answer to your first question.

How can they replace only one entry with q and say that this is entropy of q?

In the paper $h(q)$ is not computed this way. The inequality of Lemma 4.2 is used to prove that $h(p) \le log(n)$ and $h(p) \lt log(n)$ if $p$ is not the uniform distribution with $p_1=p_2=\ldots p_n=\frac{1}{n}$

Lemma 4.2: $$-\sum_{i=1}^{n}p_i \log{p_i} \le -\sum_{i=1}^{n}p_i \log{q_i} \tag{1} $$ Equality holds iff $$p_i=q_i, i=1,\ldots , n \tag{2}$$

$\square$

We know that the entropy is defined by $$h(p)=-\sum_{i=1}^{n}p_i \log{p_i} \tag{3} $$ This can be used to reformulate the inequation of the Lemma as

$$ h(p)\le -\sum_{i=1}^{n}p_i \log{q_i} \tag{4} $$

This is valid for all discrete distributions so also for the uniform distribution with $$q_i=\frac{1}{n} ,i=1,\ldots,n \tag{4a} $$ Substituting $\frac{1}{n}$ for $q_i$ gives

$$ h(p)\le \sum_{i=1}^{n}p_i \log{n} = (\log{n}) \cdot \sum_{i=1}^{n}p_i = \log{n} \tag{5} $$

But $log{(n)}$ is also $h(q)$, if $q$ is the uniform distribution. This can checked simply by using the definition of the entropy:

$$h(q)=-\sum_{i=1}^{n}q_i \log{q_i}=-\sum_{i=1}^{n}\frac{1}{n} \log{\frac{1}{n}} = \log{n} \sum_{i=1}^{n}\frac{1}{n} = \log{n} \tag{6} $$ So it follows that for the uniform distribution $q$ $$h(p) \le \log{n} = h(q) \tag{7} $$

Because of $(6)$ and $(2)$ equality holds exactly if $p$ is the uniform distribution too.

Edit:

Theorem 5.1 states, that the continous probability density on [a,b] with $\mu = \frac{a+b}{2}$ that maximizes entropy is uniform distribution $q(x)=\frac{1}{b-a}, x \in [a,b]$. This complies with the principle of indifference for coninous variable found here.

On the whole real line there is no uniform probability density. On the whole real line there is also no continous probability density with highest entropy, because there are continous probability densities with arbitrary high entropies, e.g. the gaussian distribution has entropy $\frac{1}{2}(1+\log(2 \pi \sigma^2))$: if we increase $\sigma$ the entropy increases.

Because there is no maximal entropy for continuous densities over $R$ we must have additional constraints, e.g. the constraint that $\sigma$ is fixed and that $\mu$ is fixed. The fact that there is a given finite $\sigma^2$ and $\mu$ for me makes intuitively clear that there values nearer to $\mu$ must have higher probability. If you don't fix $\mu$ then you will get no unique solution.The Gaussian distribution for each real $\mu$ is a solution: this is some kind of "uniformness", all $\mu$ can be used for a solution.

Notice that it is crucial to fix $\sigma$, $\mu$ and to demand $p(x)>0 , \forall x \in R$. If you fix other values or change the form $R$ to another domain for the density funtion , e.g. $R^+$, you will get other solution: the exponential distribution, the truncated exponential distribution, the laplace distribution, the lognorma distribution (Theorems 3.3, 5.1, 5.2, 5.3)

Statistics – MLE of Multivariate (Bivariate) Normal Distribution

In your parametrization, the random variables $X_1$ and $X_2$ (where $X_i=[X_{i1},X_{i2}]$) are not correlated, therefore you can treat them independently (for each column of your matirx $X$), i.e. use the mean and sample variance as the estimators.

In additions: If you change your parametrization, and allow a full covariance matrix $\Sigma$ then you can use the following estimator:

$\Sigma=\frac{1}{n-1}\Sigma_{i=1}^n (X_i-\bar{X})((X_i-\bar{X}))^T$

where $X_i=[X_{i1},\ldots,X_{im}]^T$ is the $i$th column of matrix $X^T$ and $\bar{X}=\frac{1}{n}\Sigma_{i=1}^n X_i$ is your sample mean.

You can easily show that, this results in maximum likelihood estimation of you the mean and covariance, let start by the the likelihood function:

$f(X|\mu,\Sigma)=\frac{1}{\sqrt|\det(2\pi\Sigma)|^n}e^{\frac{-1}{2}\sum_i (X_i-\mu)^T\Sigma^{-1}(X_i-\mu)}$

Therefore:

$\log f(X|\mu,\Sigma)=\frac{-n}{2}\log(|\det(2\pi\Sigma)|)-\frac{1}{2}\sum_i (X_i-\mu)^T\Sigma^{-1}(X_i-\mu)$

$\log f(X|\mu,\Sigma)=-n\log(2\pi)-\frac{n}{2}\log(|\det(\Sigma)|)-\frac{1}{2}\sum_i (X_i-\mu)^T\Sigma^{-1}(X_i-\mu)$ (I)

$\Rightarrow \log f(X|\mu,\Sigma)=-n\log(2\pi)-\frac{n}{2}\log(|\det(\Sigma)|)-\frac{1}{2}\sum_i (X_i\Sigma^{-1} X_i^T-2\mu\Sigma^{-1} X_i^T+\mu\Sigma^{-1}\mu^T)$

$\Rightarrow \frac{\partial}{\partial \mu}\log f(X|\mu,\Sigma)=-\frac{1}{2}\sum_i (-2 \Sigma^{-1} X_i+2\Sigma\mu)=0$

$\Rightarrow \sum_i (-\Sigma^{-1} X_i+\Sigma^{-1}\mu)=0$, multiply both sides by $\Sigma$, you get:

$\sum_i X_i=n\mu$, therefore $\hat{\mu}_{MLE}=\frac{1}{n}\sum_i X_i$.

Edit: Now I add the derivation of MLE for $\Sigma$ here, start from (I):

$\log f(X|\mu,\Sigma)=-n\log(2\pi)-\frac{n}{2}\log(|\det(\Sigma)|)-\frac{1}{2}\sum_i (X_i-\mu)^T\Sigma^{-1}(X_i-\mu)$

W.l.g. assume $\Sigma$ is PD (not PSD, then we should use pseudo-inverse and pseudo-determinant), $\det(\Sigma)\geq 0$, therefore:

$\Rightarrow \log f(X|\mu,\Sigma)=-n\log(2\pi)-\frac{n}{2}\log(\det(\Sigma))-\frac{1}{2}\sum_i (X_i-\mu)^T\Sigma^{-1}(X_i-\mu)$

Note that, for $a,b \in R^k$, and $M \in R^{k\times k}$, $a^TMb=tr(a^TMb)=tr(ba^TM)$ ($tr()$ is the trace function and the last equality is by circularity of trace.)

$\Rightarrow \log f(X|\mu,\Sigma)=-n\log(2\pi)-\frac{n}{2}\log(\det(\Sigma))-\frac{1}{2}\sum_i tr((X_i-\mu)(X_i-\mu)^T\Sigma^{-1})$

We have that $\frac{\partial}{\partial\Sigma}\log(\det(\Sigma))=(\Sigma^{-1})^T$:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{\partial}{\partial\Sigma}\frac{n}{2}\log(\det(\Sigma))-\frac{1}{2}\sum_i \frac{\partial}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\Sigma^{-1})$

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{\partial}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\Sigma^{-1})$

With some abuse of notation: $\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\partial\Sigma^{-1})$

$\partial\Sigma^{-1}=-\Sigma^{-1}\partial\Sigma\Sigma^{-1}$, by substitution:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T(-\Sigma^{-1}\partial\Sigma\Sigma^{-1}))$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1}\partial\Sigma)$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T$

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=0$

$\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=\frac{n}{2}(\Sigma^{-1})^T$

$\Rightarrow \sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})=n\Sigma^{-1}$

From left and right multiply by $\Sigma$:

$\Rightarrow \sum_i \Sigma(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})\Sigma=n\Sigma\Sigma^{-1}\Sigma$

$\Rightarrow \sum_i (X_i-\mu)(X_i-\mu)^T=n\Sigma$

$\Rightarrow \hat{\Sigma}_{MLE}=\frac{1}{n}\sum_i (X_i-\hat{\mu})(X_i-\hat{\mu})^T$

This is a biased estimator and you can fix it by using:

$\Rightarrow \hat{\Sigma}=\frac{1}{n-1}\sum_i (X_i-\hat{\mu})(X_i-\hat{\mu})^T$

instead, I hope this helps.

Best Answer

Related Solutions

Equivalence Between Uniform and Normal Distribution

Statistics – MLE of Multivariate (Bivariate) Normal Distribution

Related Question