Maximum Likelihood – How to Know MLE of Multivariate Gaussian Distribution is Argmax

convex optimizationmaximum likelihoodoptimizationstatistics

I was thinking about an optimization question that arises in parametric statistics.
Suppose that you have $(x_i)_{1\leq i \leq n}$, $n$ i.i.d observations in $\mathbb{R}^d$ that follow a multivariate gaussian distribution $\mathcal{N}(\mu, \Sigma)$ of unknown parameters $(\mu, \Sigma)$.

The log-likelihood of the data is

$$\ell_{\mu, \Sigma}(x_{1:n}) = \sum_{i=1}^n{\left(-\frac{d}{2}\log(2\pi) – \frac{1}{2}\log\det\Sigma – (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)/2\right)} $$

Now, the maximum likelihood estimator (obtained by taking $\nabla_{\mu}\ell_{\mu, \Sigma} = 0, \nabla_{\Sigma}\ell_{\mu, \Sigma} = 0$) is the couple (empirical mean, empirical covariance).

How do we know that this estimator is equal to $argmax_{\mu, \Sigma}{\ell_{\mu, \Sigma}}$ ? It's not obvious in my opinion.

It would be clear if $\ell_{\mu, \Sigma}$ was concave in $(\mu, \Sigma)$ but this is not the case. Indeed, the function $-\log\det\Sigma$ is convex in $\Sigma$ (according to https://web.stanford.edu/%7Eboyd/cvxbook/bv_cvxbook.pdf page 74) and $-(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)/2$ is concave in $(\mu, \Sigma)$ (according to Boyd's book page 76 example 3.4).

Best Answer

I'll first get rid of a bunch of symbols, to isolate the main problem. Let me write $$ m = \sum x_i/n, V = \sum (x_i - m)(x_i - m)^\top/n.$$ Normalising by $n/2,$ dropping the independent term, and writing $\nu = \mu - m,$ the objective reduces to

$$ \mathcal{L}(\nu,\Sigma) = -\langle \Sigma^{-1}, V + \nu\nu^\top\rangle - \log\det\Sigma, $$ where $\langle A,B\rangle = \mathrm{Tr}(AB)$ is the Frobenius inner product, and where the domain is $\mathbb{R}^d \times \mathcal{C}_\succ,$ with $\mathcal{C}_\succ$ denoting the cone of positive definite symmetric matrices. Note that the boundary of this cone is the positive semidefinite symmetric matrices. Now the first observation is that since $\Sigma^{-1} \succ 0,$ any nonzero $\nu$ can only decrease the objective, and so there must always be a global maximum with $\nu = 0$. Thus, we can reduce to studying the optimisation of $$\mathcal{M}(\Sigma) = - \langle \Sigma^{-1}, V\rangle - \log \det\Sigma. $$

We can break things into two cases:

  1. If $V$ is positive definite, then let me reparametrise in terms of the precision matrix $\Theta = \Sigma^{-1}$. Using $\log\det A = - \log \det A^{-1}$ for invertible matrices, we immediately end up with the convex program of maximising $$\mathcal{N}(\Theta) = -\langle \Theta, V\rangle + \log\det \Theta.$$ The objective is coercive (since $x \gg \log x$), and first order condition $\Theta^{-1} = V,$ lies in the interior of the cone $\mathcal{C}_>,$ and so is the global optimum. Of course this first order condition corresponds exactly to $\Sigma = V$.

  2. If $V$ is not positive definite, then notice that the program in $\Theta$ doesn't have a maximiser (take $\Theta = \Theta_0 + \lambda D$ for some $D \succ 0: \langle V, D\rangle = 0$, and send $\lambda \to \infty$). The same consideration tells us that $\sup \mathcal{M}(\Sigma) = \infty,$ and one of the points "attaining" this is $\Sigma = V \in \partial\mathcal{C}_{\succ}.$ Indeed, without loss of generality (why?) we can just assume that $V = \begin{pmatrix} I_k & 0 \\ 0 & 0\end{pmatrix}$ for some $k < d$. Then over the domain $\varepsilon \in (0,\infty),$ $$ f(\varepsilon) := -\langle (V + \varepsilon I)^{-1} ,V\rangle - \log\det (V+ \varepsilon I) = -k/(1+\varepsilon) - (d-k)\log \varepsilon -k\log(1+\varepsilon)$$ is seen to be continuous in $\varepsilon,$ and explode to $\infty$ as $\varepsilon \to 0,$ and so $\Sigma_* = V = \lim_{\varepsilon \to 0} V + \varepsilon I$ is the limit of a sequence of feasible points whose value converges to the optimal value ($\infty$), and can in a sense be seen as a maximiser. That said, it's not the only solution: for example, if $k \le d-2,$ then $\Sigma' = \begin{pmatrix} I_{k+1} & 0 \\ 0 & 0\end{pmatrix}$ may also be considered a maximiser in the same sense. The standard solution of $\Sigma = V$ is typically chosen for simplicity, and because it agrees formulaically with the case where $V$ is invertible.

Related Question