Maximum Likelihood – Estimators for Multivariate Gaussian Distributions

estimatorsmaximum likelihoodmultivariate normal distributionnormal distribution

Context

The Multivariate Gaussian appears frequently in Machine Learning and the following results are used in many ML books and courses without the derivations.

Given data in form of a matrix $\mathbf{X} $ of dimensions
$ m \times p$, if we assume that the data follows a $p$-variate Gaussian
distribution with parameters mean $\mu$ ( $p \times 1 $) and
covariance matrix $\Sigma$ ($p \times p$) the Maximum Likelihood
Estimators
are given by:

  • $\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$
  • $\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} – \hat \mu) (x^{(i)} -\hat \mu)}^T $

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.


Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian


Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

Best Answer

An alternate proof for $\widehat{\Sigma}$ that takes the derivative with respect to $\Sigma$ directly:

Picking up with the log-likelihood as above: \begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray} where $S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$ and we have used the cyclic and linear properties of $\text{tr}$. To compute $\partial \ell /\partial \Sigma$ we first observe that $$ \frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1} $$ by the fourth property above. To take the derivative of the second term we will need the property that $$ \frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T. $$ (from The Matrix Cookbook, equation 63). Applying this with $B=I$ we obtain that $$ \frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1} $$ because both $\Sigma$ and $S_\mu$ are symmetric. Then $$ \frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}. $$ Setting this to 0 and rearranging gives $$ \widehat{\Sigma} = \frac{1}{m}S_\mu. $$

This approach is more work than the standard one using derivatives with respect to $\Lambda = \Sigma^{-1}$, and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $\partial/{\partial \Sigma^{-1}}$ than $\partial/\partial \Sigma$.