Chain Rule for Matrix Valued Derivatives with intermediate high dimensional tensors

matrix-calculus

I have one function $f: R^{m \times m} \to R$ and another function $g: R \to R^{m \times m}$. I would like to calculate the derivate $\frac{\partial h}{\partial x}$ for the composite function $h(x) = f(g(x))$. The expression I find is really messy and hard for me to do manipulation. I wonder if there is any more 'elegant' and algebra friendly way to do this.

The problem is that $\frac{\partial f}{\partial x}$ is a $(m \times m) \times 1$ tensor, and $\frac{\partial g}{\partial x}$ is a $1 \times (m \times m)$ tensor, and naively thinking that they return $m \times m$ matrices and apply matrix multiplication with the chain rule fails miserably.

$$
\frac{\partial h}{\partial x} \ne \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}
$$
, at least not with 2D matrix multiplication.

Indeed, let
$$
\begin{pmatrix}
\frac{\partial f}{\partial y_{11}} \dots \frac{\partial f}{\partial y_{1m}} \\
\vdots
\frac{\partial f}{\partial y_{m1}} \dots \frac{\partial f}{\partial y_{mm}} \\
\end{pmatrix}
=\frac{\partial f(y)}{\partial y}
$$

and

$$
\begin{pmatrix}
\frac{\partial g_{11}}{\partial x} \dots \frac{\partial g_{1m}}{\partial x} \\
\vdots
\frac{\partial g_{m1}}{\partial x} \dots \frac{\partial g_{mm}}{\partial x} \\
\end{pmatrix}
=\frac{\partial g(x)}{\partial x}
$$

What we want is:
$$
\frac{\partial h}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial f}{\partial y_{ij}} \frac{\partial g_{ij}}{\partial x}
$$

But my problem is that it is really difficult for me to do any manipulation if I write things like this. My question is if I could write things out more 'analytically.'

Below is an example where I want to find $x$ so that $\frac{\partial h}{\partial x}=0$, but with these relationships broken up into several equations and matrix math is broken down, it is really hard.


Below is a real example

Let
$$
l(\Sigma^{-1})= – \frac{1}{2} y^T \Sigma^{-1} y – \frac{1}{2} \ln |\det \Sigma|
$$
, where $\Sigma$ is an $m \times m$ positive definite symmetric matrix and $y$ is an $m \times 1$ vector. (This is the log-likelihood of a multivariate gaussian distribution, but this is not important.)

l(\Sigma^{-1}) can be thought of a $R^{m \times m} \to R$ function.

Let
$$
\Sigma(x) = K + x \mathbf{I}
$$

where $K$ is an $m \times m$ positive definite symmetric matrix and $I$ is an $m \times m$ identity matrix.

$\Sigma(x)$ can be thought of as a $R \to R^{m \times m}$ function.

It is easy to show that:
$$
\frac{\partial l}{\partial \Sigma^{-1}} = – \frac{1}{2} y y^T + \frac{1}{2} \Sigma
$$

and
$$
\frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2}
$$

To find $\frac{\partial l}{\partial x}$, we can do the following:

Let $\gamma_{ij}$ be the ith row and jth column of $\Sigma^{-1}$.

It is really:
$$
\begin{pmatrix}
\frac{\partial l}{\partial \gamma_{11}} \dots \frac{\partial l}{\partial \gamma_{1m}} \\
\vdots \\
\frac{\partial l}{\partial \gamma_{m1}} \dots \frac{\partial l}{\partial \gamma_{mm}} \\
\end{pmatrix} =
\frac{\partial l}{\partial \Sigma^{-1}} = – \frac{1}{2} y y^T + \frac{1}{2} \Sigma
$$

and
$$
\begin{pmatrix}
\frac{\partial \gamma_{11}}{\partial x} \dots \frac{\partial \gamma_{1m}}{\partial x} \\
\vdots \\
\frac{\partial \gamma_{m1}}{\partial x} \dots \frac{\partial \gamma_{mm}}{\partial x} \\
\end{pmatrix} =
\frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2}
$$

What we are looking for is really:
$$
\frac{\partial l}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial l}{\partial \gamma_{ij}} \frac{\partial \gamma_{ij}}{\partial x}
$$

It takes me three equations to express $\frac{\partial l}{\partial x}$. It is extremely messy. My problem then is that there is no way for me to write it out more "analytically" so that I can find $x$ solve
$$
\frac{\partial l}{\partial x} = 0
$$
analytically.

Best Answer

Using differentials is less error-prone (for me) than the chain rule because (algebraically) differentials act like normal vectors and matrices.

Applying the differential technique to your example $\big($for ease of typing let $M=\Sigma\big)$ $$\eqalign{ 2l &= -y^TM^{-1}y - \log\det(M) \cr 2dl &= -y^T\,dM^{-1}\,y - M^{-1}:dM \cr &= +y^T\big(M^{-1}\,dM\,M^{-1}\big)y - M^{-1}:dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):I\,dx \cr &= {\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big)\,dx \cr \frac{dl}{dx} &= \tfrac{1}{2}{\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big) \cr &= \tfrac{1}{2}\Big(y^TM^{-2}y-{\rm Tr}(M^{-1})\Big) \cr }$$ where a colon was used to denote the matrix inner product, i.e. $\,A:B={\rm Tr}(A^TB).$

Due to the cyclic property of the trace, there are many ways to shuffle the terms in a such a product $$\eqalign{ A:BC &= A^T:(BC)^T = BC:A\cr&= B^TA:C = AC^T:B \cr I:B &= {\rm Tr}(I^TB) = {\rm Tr}(B) \cr }$$ A scalar value is equal to it own trace, therefore $$\eqalign{ x^TAx &= {\rm Tr}(x^TAx) = x:Ax = xx^T:A }$$ One last differential relationship comes from the defining property of the matrix inverse. $$\eqalign{ I &= M^{-1}M \cr dI &= M^{-1}dM + dM^{-1}M \,= 0 \cr dM^{-1} &= -M^{-1}\,dM\,M^{-1} \cr }$$

Related Question