# Maximum Likelihood – Estimators for Multivariate Gaussian Distributions

estimatorsmaximum likelihoodmultivariate normal distributionnormal distribution

### Context

The Multivariate Gaussian appears frequently in Machine Learning and the following results are used in many ML books and courses without the derivations.

Given data in form of a matrix $\mathbf{X}$ of dimensions
$m \times p$, if we assume that the data follows a $p$-variate Gaussian
distribution with parameters mean $\mu$ ( $p \times 1$) and
covariance matrix $\Sigma$ ($p \times p$) the Maximum Likelihood
Estimators
are given by:

• $\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$
• $\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} – \hat \mu) (x^{(i)} -\hat \mu)}^T$

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.

### Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian

### Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

An alternate proof for $$\widehat{\Sigma}$$ that takes the derivative with respect to $$\Sigma$$ directly:
Picking up with the log-likelihood as above: $$\begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray}$$ where $$S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$$ and we have used the cyclic and linear properties of $$\text{tr}$$. To compute $$\partial \ell /\partial \Sigma$$ we first observe that $$\frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1}$$ by the fourth property above. To take the derivative of the second term we will need the property that $$\frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T.$$ (from The Matrix Cookbook, equation 63). Applying this with $$B=I$$ we obtain that $$\frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1}$$ because both $$\Sigma$$ and $$S_\mu$$ are symmetric. Then $$\frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}.$$ Setting this to 0 and rearranging gives $$\widehat{\Sigma} = \frac{1}{m}S_\mu.$$
This approach is more work than the standard one using derivatives with respect to $$\Lambda = \Sigma^{-1}$$, and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $$\partial/{\partial \Sigma^{-1}}$$ than $$\partial/\partial \Sigma$$.