[Math] Second Derivative with respect to a Matrix

derivativesfisher informationmatricesmatrix-calculus

I have a question regarding (second order) derivative with respect to a matrix. I encounter this question because I am calculating Fisher Information, but I guess the context is not very relevant in this question.

Here is the derivative:

$$
\frac{\partial}{\partial \Sigma} \Sigma^{-1}A\Sigma^{-1}
$$

where $\Sigma$ is a covariance matrix (positive semi-definite, symmetric), and $A = (x_i – \mu_0)(x_i -\mu_o)^T$, but we may simply use $A$ instead while knowing $A$ is symmetric.

Before posting this question, I have searched on google, and found several sources useful and relevant, but do not answer my question straightaway:

  1. https://www.ics.uci.edu/~welling/teaching/KernelsICS273B/MatrixCookBook.pdf
  2. Second order derivative of the inverse matrix operator

Consequently, I have made a coarse attempt to derive it, but I am not confident whether it is correct.

====================================================================

Consider a very small $\delta\Sigma$

\begin{align*}
(\Sigma + \delta\Sigma)^{-1}A(\Sigma+\delta\Sigma)^{-1} &= [\Sigma(I+\Sigma^{-1}(\delta\Sigma))]^{-1}A[(I+(\delta\Sigma)\Sigma^{-1})\Sigma]^{-1}\\
&=(I+\Sigma^{-1}(\delta\Sigma))^{-1}\Sigma^{-1}A\Sigma^{-1}(I+(\delta\Sigma)\Sigma^{-1})^{-1}\\
&=(\sum_{n=0}^\infty(-1)^n[\Sigma^{-1}(\delta\Sigma)]^n)\Sigma^{-1}A\Sigma^{-1}(\sum_{n=0}^\infty(-1)^n[(\delta\Sigma)\Sigma^{-1}]^n)\\
&\approx (I-\Sigma^{-1}(\delta\Sigma))\Sigma^{-1}A\Sigma^{-1}(I-(\delta\Sigma)\Sigma^{-1})\\
&=\Sigma^{-1}A\Sigma^{-1} – \Sigma^{-1}(\delta\Sigma)\Sigma^{-1}A\Sigma^{-1}-\Sigma^{-1}A\Sigma^{-1}(\delta\Sigma)\Sigma^{-1}\\
+\Sigma^{-1}(\delta\Sigma)\Sigma^{-1}A\Sigma^{-1}(\delta\Sigma)\Sigma^{-1}
\end{align*}

Then, we may have

\begin{align*}
(\frac{\partial}{\partial \Sigma} \Sigma^{-1}A\Sigma^{-1})\delta\Sigma &= \lim_{||\delta\Sigma||\rightarrow0}(\Sigma + \delta\Sigma)^{-1}A(\Sigma+\delta\Sigma)^{-1} – \Sigma^{-1}A\Sigma^{-1}\\
&\approx \lim_{||\delta\Sigma||\rightarrow0}- \Sigma^{-1}(\delta\Sigma)\Sigma^{-1}A\Sigma^{-1}-\Sigma^{-1}A\Sigma^{-1}(\delta\Sigma)\Sigma^{-1}
\end{align*}

(somehow by magic or by speculating, I guess)
$$
\frac{\partial}{\partial \Sigma} \Sigma^{-1}A\Sigma^{-1} = – \Sigma^{-2}A\Sigma^{-1}-\Sigma^{-1}A\Sigma^{-2}
$$

====================================================================

I have a feeling that I may be around there, but not quite yet. I am really hoping to get from this question the output of the derivative.

Thank you so much for all of your time!

p.s.:

  1. You do not have to follow my trail of thoughts (which could be wrong per se), and you may just show the correct way of doing this.

  2. I call this second order derivative because $\Sigma^{-1}A\Sigma^{-1}$ is what I have obtained by taking first derivative of $(x_i-\mu_0)^T\Sigma^{-1}(x_i-\mu_0)$, and yes, all you smart people may have realized this is multivariate normal.

====================================================================

In a month after I posted this question, I managed to find a great reference I would like to share. For those people who are having similar questions, here is a book that will give you a great insight (which closely resembles the method presented by @greg).

"Matrix Differential Calculus with applications in statistics" by Magnus and Neudecker.

Take a look at its Chapter 2, which offers great explanations (and examples) about kronecker product and vector operation, two important concepts when dealing with matrix differential.

Best Answer

For ease of typing let's use the notations $$\eqalign{ X &= \Sigma \cr A:X &= {\rm \,tr\,}(A^TX) \,\,\,\,\,\,\text{\{trace/Frobenius product\}} \cr }$$

Now we can write the original scalar function and find its differential and gradient $$\eqalign{ \phi &= A:X^{-1} \cr d\phi &= A:dX^{-1} = -A:X^{-1}\,dX\,X^{-1} = -X^{-1}AX^{-1}:dX \cr G=\frac{\partial\phi}{\partial X} &= -X^{-1}AX^{-1} \cr }$$ To proceed to the Hessian, let's introduce the 4th order tensor ${\mathcal H}$ with components $$\eqalign{ {\mathcal H}_{ijkl} = \delta_{ik}\,\delta_{jl} \cr }$$ Now we can calculate the differential and gradient of $G$ as $$\eqalign{ dG &= -dX^{-1}\,AX^{-1} -X^{-1}A\,dX^{-1} \cr &= X^{-1}\,dX\,X^{-1}AX^{-1} + X^{-1}AX^{-1}\,dX\,X^{-1} \cr &= -(X^{-1}\,dX\,G + G\,dX\,X^{-1}) \cr &= -(X^{-1}{\mathcal H}G + G{\mathcal H}X^{-1}):dX \cr \frac{\partial^2\phi}{\partial X^2} = \frac{\partial G}{\partial X} &= -(X^{-1}{\mathcal H}G + G{\mathcal H}X^{-1}) \cr\cr }$$ If you are not comfortable with higher-order tensors, you can use vectorization instead $$\eqalign{ {\rm vec}(dG) &= -{\rm vec}(X^{-1}\,dX\,G + G\,dX\,X^{-1}) \cr dg &= -(G\otimes X^{-1} + X^{-1}\otimes G)\,dx \cr \frac{\partial g}{\partial x} &= -(G\otimes X^{-1} + X^{-1}\otimes G) \cr\cr }$$ NB: In some of these steps, I made use of the fact that $(X,A,G)$ are symmetric matrices.