Gradient of Trace (A B A’ C) – Matrix Calculus

derivativesmatricesmatrix-calculusscalar-fieldstrace

Given three matrices $A$, $B$ and $C$ such that $ABA^T C$ is a square matrix, the derivative of the trace with respect to $A$ is:

$$
\nabla_A \operatorname{trace}( ABA^{T}C ) = CAB + C^T AB^T
$$

There is a proof here, page 4 (PDF file). However, I was not able to understand the second line of the proof. it says something like this:

$$
\nabla_A \operatorname{trace}( ABA^T C )=
\nabla_\bullet \operatorname{trace}(f(\bullet)A^{T}C) + \nabla_\bullet \operatorname{trace}(f(A) \bullet^T C)
$$

Where $f(A) = AB$

I tried searching the proof or any hint using google, but I couldn't fine anything. Could anyone please help me with this second line. I cannot sleep. I am obsessed with this proof.

Best Answer

Since those "dots" appear nowhere else in the document, and the calculation is labeled "funky," I wonder if the author himself even understands what is going on there. (If I came across this while grading, I'd assume the writer had asked a friend for an explanation and copied down only half of what they said. It's a level of detail associated more with personal notes than communicating with others.)

I don't mean to criticize: the spirit of that document is to collect a number of matrix calculations and that is fine. But when the author uses the word "prove" (in "we prove that...") he is using it in a very loose sense. Enough detail is given in the calculation so that if you understand it, you could go and construct a proof. But a chain of unjustified, unexplained equations is not a "proof" in the strict sense. (The appearance of "$f'$" in that context cannot actually be explained in a way consistent with his definitions of $f$ and $'$, and so maybe counts as a mistake--- although it is immediately replaced with the right thing.)

Anyway, it is implicit use of the chain rule. ("Implicit" in the English language sense of "there but not mentioned", not in any sense like implicit differentiation.) Here is a scalar-calculus analogy. One way to compute the derivative of $$ g(a) = a^a $$ is to regard $g(a)$ as the result of plugging the pair of functions $x = a$ and $y = a$ into the multivariable function $H(x,y) = x^y$. Whenever $H(x,y)$ is a function of two variables $x$ and $y$, each of which is a function of $a$, the multivariable chain rule tells you that $$ \frac{d}{da} H(x,y) = \frac{\partial H(x, y)}{\partial x} \cdot \frac{dx}{da} + \frac{\partial H(x,y)}{\partial y} \cdot \frac{dy}{da} $$ and the the partials of $H(x,y) = x^y$ are easy to calculate since in computing the $x$-partial derivative you hold $y$ fixed, and vice versa: $\frac{\partial H}{\partial x} = y x^{y-1}$ (from the power rule) and $\frac{\partial H}{\partial y} = x^y \ln(x)$ (from the rule for differentiating exponential functions). So with $H(x,y) = x^y$ and $x=a$ and $y=a$ in the above, we see \begin{align*} \frac{d}{da} (a^a) & = \frac{\partial H(a, a)}{\partial x} \cdot 1 + \frac{\partial H(a,a)}{\partial y} \cdot 1 \\ & = a a^{a - 1} + a^a \ln(a) = a^a + a^a \ln(a). \end{align*} It is not too much trouble to follow this calculation because I introduced new variables $x$ and $y$ and explicitly defined the function $H$ I was using. If instead of writing the new variables I wrote dots, and instead of defining my new function $H$ I just wrote out formulas with $a$s and dots, it might get confusing.

That's basically what's going on in that document. Fix matrices $B$ and $C$. The function $$ g(A) = \operatorname{trace} (ABA^TC) $$ of the matrix $A$ has the form $$ g(A) = H(f(A),A) $$ for the scalar valued function $H(X,Y) = \operatorname{trace} X Y^T C$ of two matrix variables and the matrix valued function $f(A) = AB$ of one variable. The chain rule implies that in this case we have $$ \nabla_A H(f(A),A) = (\nabla_X H(X,Y)) B^T + \nabla_Y H(X,Y) $$ where you evaluate the right hand side at $X = f(A)$ and $Y = A$ after computing the matrix derivatives. Elsewhere in that document he sketches how to compute $\nabla_X H(X,Y) = (Y^T C)^T$ and $\nabla_Y H(X,Y) = CX$. So that is how the result $(A^T C)^T B^T + CAB$ comes out. Just like the scalar case, introducing new variables so that you can treat some as constant when differentiating with respect to others, simplifies the calculation.

To return to my initial remark about $f'$: many of that document's formal particulars are sketchy. For example, the nonrigorous treatment of the Jacobian in section 4 is just to motivate calculation, and the appearance of $f'$ in the calculation you ask about is not consistent with this use. My guess is that the author wanted to remind himself that something coming from the chain rule and related to $f$ went there, and so he wrote $f'$ as a mental reminder. Since he replaced it with the right thing, $B^T$, no harm was done.

... And one is right to be cautious about what exactly goes there, since for arbitrary matrix valued functions $X = f(A)$ and $Y = k(A)$, the expression for $\nabla_A H(f(A),k(A))$ is a complicated function of the entries of $\nabla_X H(X,Y)$ and $\nabla_Y H(X,Y)$ and the partial derivatives of the entries of $f(A)$ and $g(A)$. It is not easily written as a matrix product of $\nabla_X H(X,Y)$ with a matrix depending only on $f$, plus a matrix product of $\nabla_Y H(X,Y)$ with a matrix depending only on $g$. We got a simple formula above because our $f(A) = AB$ and $k(A) = A$ were simple functions of $A$ (e.g. any one entry of $f(A)$ depends only on one row of entries of $A$, and not all of the entries of $A$, as might happen in general). If you were wondering why the author did not include a general matrix derivative chain rule, this is why. It doesn't look very nice in general; it only looks nice in various special cases.