Clean/simple way of computing $\nabla f(U U^\mathsf{T})$ with respect to $U$

matrix-calculusmultivariable-calculus

Let $f(X)$ be a matrix function which takes in an $n \times n$ matrix $X$ and spits out something in $\mathbb{R}$. Let $U \in \mathbb{R}^{n \times r}$ for some $r << n$. I was reading a paper which stated without explanation that the gradient of $f$ with respect to $U$ is

$$
\left(\nabla f (U U^\mathsf{T}) + \nabla f(U U^\mathsf{T})^\mathsf{T}\right) \cdot U.
$$

The gradient of $f$ with respect to $X$ is naturally defined as the matrix $\nabla f(X) \in \mathbb{R}^{n \times n}$ such that

$$
[\nabla f(X)]_{ij} = \frac{\partial f(X)}{\partial x_{ij}}.
$$

I spent a significant amount of time trying to derive this using the chain rule for multivariate functions. I.e., I thought of $f : \mathbb{R^{n^2}} \to \mathbb{R}$ as being precomposed by $g : \mathbb{R}^{n \times r} \to \mathbb{R}^{n^2}$ defined by $g(U) = U U^\mathsf{T}$. But I don't see how the fact given follows easily from this. It could be that I'm just far too used to thinking about gradients as vectors, and I can't translate to these matrix functions.

Ultimately, I gave up on that approach and verified the identity by working with individual partial derivatives. But my question is, is there a neat way to derive this fact without resorting to working with partial derivatives?

If there is, I really want to know it, because I find that I struggle to compute derivatives of matrix functions. How would someone experienced in this area approach this problem? Thank you!

EDIT: Thanks to @greg's answer, I have decided to study (matrix) differentials. Here are some resources (in order from favorite to least favorite; but all three are useful) I found which might help others who stumble upon this question:

https://tminka.github.io/papers/matrix/

http://artem.sobolev.name/posts/2017-01-29-matrix-and-vector-calculus-via-differentials.html

http://thousandfold.net/cz/2013/11/12/a-useful-trick-for-computing-gradients-w-r-t-matrix-arguments-with-some-examples/

Best Answer

$\def\b{\big\|}\def\p#1#2{\frac{\partial #1}{\partial #2}}$Given $$\eqalign{ f = f(X)\qquad G = \p{f}{X}\qquad X = UU^T \\ }$$ Attack the problem in three steps: write the differential in terms of $X$, change the independent variable from $X\to U$, and extract the new gradient. $$\eqalign{ df &= G:dX \\ &= G:\left(dU\,U^T+U\,dU^T\right) \\ &= \left(G+G^T\right):dU\,U^T \\ &= \left(G+G^T\right)U:dU \\ \p{f}{U} &= \left(G+G^T\right)U \\\\ }$$


In the above steps, a colon is used to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B &= {\rm Tr}(AB^T) \;=\; \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} \\ A:A &= \b A\b^2_F \\ }$$ The properties of the underlying trace function allow the terms in such products to be rearranged in many equivalent ways $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$ Note that the matrix on each side of the colon have the same dimensions.

The Frobenius product also interacts nicely with the Hadamard and Kronecker products $$\eqalign{ A:(B\odot C) &= (A\odot B):C \;=\; \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} C_{ij} \\ (A\otimes B):(E\otimes F) &= (A:E)\otimes(B:F) \\\\ }$$


As you have discovered, the chain rule can be difficult to apply in matrix calculus, because the gradient operation changes the tensorial character of a variable.

For instance, the gradient of a vector with respect to another vector is a matrix. But matrix multiplication is non-commutative, which gums up the chain rule.

Worse, the gradient of a matrix wrt a vector is a third-order tensor. Such a tensor is difficult to calculate, awkward to work with algebraically, and doesn't fit into standard matrix notation -- which forces you to resort to ugly hacks like component-wise derivatives.

The beauty of the differential approach is that the differential of a matrix acts like a matrix. In particular, it obeys all of the rules of matrix algebra, so it is much easier to handle.