Matrix Calculus – Matrix Derivative of f(X^T Y) with Respect to X

derivativesmatricesmatrix-calculus

Given $X\in \mathbb{R}^{m\times n}$, $Y\in \mathbb{R}^{m\times k}$, $X^\top Y=Z\in \mathbb{R}^{n\times k}$,
and $f:\mathbb{R}^{n\times k} \to \mathbb{R}$, we have the following:

\begin{equation}
f(X^\top Y)=f(Z)
\end{equation}

What is the derivative of $f$ with respect to $X$, i.e., what is $\frac{\partial f}{\partial X}$?


I tried to "expect" $\frac{\partial f}{\partial X}$ just by trying to match the dimension $m,n,k$ as follows:

\begin{equation}
\begin{aligned}
\frac{\partial f}{\partial X}
=\frac{\partial f}{\partial Z}\frac{\partial Z}{\partial X}
=Y\left(\frac{\partial f}{\partial Z}\right)^\top
\end{aligned}
\end{equation}

Since $\frac{\partial Z}{\partial X}$ should be a fourth-order tensor, I tried to calculate $\frac{\partial Z}{\partial X}$ by using
$\frac {\partial AXB}{\partial X}=B^T\otimes A$
like this:

\begin{equation}
\begin{aligned}
\frac{\partial Z}{\partial X}=\frac{\partial (X^\top Y)}{\partial X}
{=\left(\frac{\partial ((X^\top) Y)}{\partial (X^\top)}\right)^\top}
{=\left(Y^\top \otimes I \right)^\top}
{=(Y^\top)^\top \otimes I^\top}
{=Y \otimes I}
\end{aligned}
\end{equation}

Where $\otimes$ is the Kronecker product and $I$ is the $n\times n$ identity matrix. From this, we have:

\begin{equation}
\begin{aligned}
\frac{\partial f}{\partial X}
=\frac{\partial f}{\partial Z}\frac{\partial Z}{\partial X}
=\frac{\partial f}{\partial Z}\left(Y \otimes I\right)
\end{aligned}
\end{equation}

From above equations, this should be true:

\begin{equation}
\begin{aligned}
\frac{\partial f}{\partial X}
=\frac{\partial f}{\partial Z}\left(Y \otimes I\right)
=Y\left(\frac{\partial f}{\partial Z}\right)^\top
\end{aligned}
\end{equation}

The last equation seemed counter intuitive. I can not figure it out.
Do I calculate $\frac{\partial f}{\partial X}$ correctly ?

If I did it wrong, how should I calculate $\frac{\partial f}{\partial X}$ instead?

Best Answer

$ \def\o{{\tt1}} \def\p{\partial} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $As you have discovered, the chain rule can be difficult to apply in Matrix Calculus because it involves higher-order tensors (i.e. matrix-by-vector, vector-by-matrix, and matrix-by-matrix gradients) which are difficult to calculate, awkward to manipulate, and don't fit into standard matrix notation.

Instead I would recommend a differential approach, because the differential of a matrix behaves like a matrix. In particular, it can be written using standard matrix notation and it obeys all of the rules of matrix algebra.

Also, the $\c{\rm Frobenius}$ product is extraordinarily useful in Matrix Calculus $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \qquad \big\{{\rm \c{Frobenius}\;norm}\big\}\\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A \;=\; \LR{A^TC}:B \\\\ }$$


Using the above notation, the manipulation of your particular function becomes almost mechanical $$\eqalign{ df &= \gradLR fZ:dZ \\ &= \gradLR fZ:\LR{dX^TY} \\ &= \gradLR fZ Y^T:dX^T \\ &= Y\gradLR fZ^T:dX \\ \grad fX &= Y\gradLR fZ^T \\ }$$ Note that there was no need for a tensor-valued gradient in any step, just plain old matrices.

Also note that your initial dimension-matching approach was correct! That's not as crazy an idea as it seems. When dealing with rectangular matrices there is often only one way to fit all the pieces together and it's a useful shortcut. But it won't help when dealing with square matrices.

Related Question