Proving Maximum Entropy of Multivariate Normal Distribution

information theoryoptimizationprobabilityreal-analysisstatistics

I'm working on a homework question. The first part was:

Given an unbounded one dimensional continuous random variable: $X\in\left(-\infty,\infty\right)$, that satisfies:$\left\langle X\right\rangle =\mu,\;\left\langle \left(X-\mu\right)^{2}\right\rangle =\sigma^{2}$ Show that the distribution that maximizes entropy is Gaussian $X\sim N\left(\mu,\sigma^{2}\right)$.

I've solved this using Lagrange multipliers method. The next part is proving the same holds in the case of multivariate distributions.

Generalize the previous part to a $k$ dimensional variable $X$ with given expectation value $\vec{\mu}$ and covariance matrix $\Sigma$.

I started the same way when I define the proper functional I wish to optimize:
$$ F\left[f_{X}\left(\overline{x}\right)\right]=H\left(X\right)+\lambda\left(1-\intop_{\mathbb{R}^{k}}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i,j\in\left[k\right]}\Lambda_{ij}\left(\Sigma_{ij}-\intop_{\mathbb{R}^{k}}\left(\mu_{i}-\overline{x}_{i}\right)\left(\mu_{j}-\overline{x}_{j}\right)f_{X}\left(\overline{x}\right)d\overline{x}\right)$$

After taking the functional derivative with regard to $f_X(\overline{x})$ and extracting the PDF, I get the following term with the Lagrange multipliers:
$$f_{X}\left(\overline{x}\right)=\exp\left(\lambda-1\right)\exp\left(-\Gamma\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$$
Where $\Gamma(k\times 1),\Lambda (k\times k),\lambda(1\times 1)$ are the multipliers.
I wish to show that these multipliers must equal the correct terms for this PDF to be a multivariate Gaussian.

For some reason this is where I get stuck, I've tried various algebraic manipulations, but the term $\exp{(-\Gamma\cdot\overline{x})}$ keeps messing up my calculations.
I'm leaving out the constraints themselves since they are written within the optimization term, and they leave me in the mess with $\exp{(-\Gamma\cdot\overline{x})}$ when I try to solve the integrals.

I feel like I'm missing something.
Would really appreciate it!

Edit:

  1. Badly enough the answers for this exercise just say – "yeah this looks gaussian, so we can find the parameters so it works out", albeit this isn't strictly a math course, I have a very hard time accepting this answer, so the question is highly relevant.

  2. There might be an easier way to solve this via KL divergence, but since the first part of the question was required for the next part I would really like to see this through.

Another Edit:

I get the calculations needed, my main issue is with the fact that the constraint $$\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)$$ is not needed, since when solving the equation I can assign $\Gamma=0$ and get the proper results. I would like a rigorous explanation why the mean is determined solely by the 3rd constraint.

Best Answer

The proof of this can be made very simple. You want to minimize the following quantity \begin{align} H[\rho] = \int_{\mathbb{R}^d} \rho \log \rho \, \mathrm{d}x \, , \end{align} where $\rho$ is the p.d.f/law of the random variable $X$, subject to the constraint that $\rho$ has mean $\mu$ and covariance $\Sigma$. Denote by $N_{\Sigma,\mu}$ the corresponding Gaussian p.d.f. Let $\rho$ be a p.d.f which satisfies this constraint. We then have that \begin{align} H[\rho]= & \int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x +\int \rho \log N_{\Sigma,\mu} \, \mathrm{d}x \\ =&\int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x -\frac{d}{2} -\frac 12 \log (2\pi)^d|\Sigma| \qquad (*) \,. \end{align} The first term in (*) is the relative entropy between $ \rho$ and $N_{\Sigma,\mu}$ and the other terms are just fixed constants. The relative entropy (or what someone in the comments called the KL divergence) is always non-negative, so if you could make it zero you would have minimized the above quantity and you would be done. You can do this by choosing $\rho =N_{\Sigma,\mu}$.

Furthermore, here's a short proof that the relative entropy is non-negative. Using Jensen's inequality (since $x \log x$ is convex), we have \begin{align} \int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x = &\int_{\mathbb{R}^d} \frac{\rho}{N_{\Sigma,\mu}} \log \left(\frac{\rho}{N_{\Sigma,\mu}} \right) N_{\Sigma,\mu} \, \mathrm{d}x \\ \geq & 1 \cdot \log 1 =0 \, . \end{align}

As a reference, look at Proposition 9.12 of the following book: https://link.springer.com/book/10.1007/978-1-4612-4286-4. It contains the result you want but in more generality. Just a note: the authors of this book are trying the maximize the entropy because they have flipped its sign (this is the physicist definition of entropy).

Related Question