[Math] MLE of the covariance matrix of a multivariate Gaussian distribution

covariancemaximum likelihoodnormal distribution

I am reading through the following question:
MLE of bivariate normal distribution

But there is one step I don't understand in the derivation of of the MLE for the covariance matrix:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{\partial}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\Sigma^{-1})$

With some abuse of notation:
$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\partial\Sigma^{-1})$

$\partial\Sigma^{-1}=-\Sigma^{-1}\partial\Sigma\Sigma^{-1}$, by substitution:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T(-\Sigma^{-1}\partial\Sigma\Sigma^{-1}))$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1}\partial\Sigma)$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T$

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=0$

$\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=\frac{n}{2}(\Sigma^{-1})^T$

The step that I don't understand is the step where the partial derivative dissapears. So $\partial$ is moved in the trace operator, then some manipulation is done on $\Sigma^{-1}$ and then somehow, at least that is how I see it, $\partial \Sigma$ is pulled out of the trace in order to get rid of the partial derivative? Can someone tell me why you can do this and why this is valid?

Best Answer

This whole derivation relies on two key aspects of matrix algebra.

  1. $\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)$ as along as the dimensions of matrices is in align with matrix multiplication.
  2. $\frac{\partial \text{tr}(AX)}{\partial X} = A^{T}$

Specific to the question you asked,

\begin{align} \sum_i & \frac{\partial}{\partial\Sigma}tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T) \\ = & \sum_i \frac{\partial}{\partial\Sigma^{-1} }tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T) \frac{\partial \Sigma^{-1} }{\partial \Sigma} \\ = & \sum_i \underbrace{{((X_i-\mu)(X_i-\mu)^T)}^{T}}_{\text{Aspect # 2 }} \frac{\partial \Sigma^{-1} }{\partial \Sigma} \\ = & \sum_i {((X_i-\mu)(X_i-\mu)^T)}^{T} \Sigma^{-1} \Sigma^{-1} \\ \end{align}

Related Question