[Math] MLE of the covariance matrix of a multivariate Gaussian distribution

I am reading through the following question:
MLE of bivariate normal distribution

But there is one step I don't understand in the derivation of of the MLE for the covariance matrix:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{\partial}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\Sigma^{-1})$

With some abuse of notation:
$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T\partial\Sigma^{-1})$

$\partial\Sigma^{-1}=-\Sigma^{-1}\partial\Sigma\Sigma^{-1}$, by substitution:

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T-\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr((X_i-\mu)(X_i-\mu)^T(-\Sigma^{-1}\partial\Sigma\Sigma^{-1}))$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i \frac{1}{\partial\Sigma}tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1}\partial\Sigma)$

$=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T$

$\Rightarrow \frac{\partial}{\partial\Sigma}\log f(X|\mu,\Sigma)=-\frac{n}{2}(\Sigma^{-1})^T+\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=0$

$\frac{1}{2}\sum_i (\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T\Sigma^{-1})^T=\frac{n}{2}(\Sigma^{-1})^T$

The step that I don't understand is the step where the partial derivative dissapears. So $\partial$ is moved in the trace operator, then some manipulation is done on $\Sigma^{-1}$ and then somehow, at least that is how I see it, $\partial \Sigma$ is pulled out of the trace in order to get rid of the partial derivative? Can someone tell me why you can do this and why this is valid?

Best Answer

This whole derivation relies on two key aspects of matrix algebra.

$\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)$ as along as the dimensions of matrices is in align with matrix multiplication.
$\frac{\partial \text{tr}(AX)}{\partial X} = A^{T}$

Specific to the question you asked,

\begin{align} \sum_i & \frac{\partial}{\partial\Sigma}tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T) \\ = & \sum_i \frac{\partial}{\partial\Sigma^{-1} }tr(\Sigma^{-1}(X_i-\mu)(X_i-\mu)^T) \frac{\partial \Sigma^{-1} }{\partial \Sigma} \\ = & \sum_i \underbrace{{((X_i-\mu)(X_i-\mu)^T)}^{T}}_{\text{Aspect # 2 }} \frac{\partial \Sigma^{-1} }{\partial \Sigma} \\ = & \sum_i {((X_i-\mu)(X_i-\mu)^T)}^{T} \Sigma^{-1} \Sigma^{-1} \\ \end{align}

Best Answer

Related Solutions

Why has the MLE for a Gaussian Distribution only one solution although not being “jointly” convex in mean and variance

Related Question