[Math] Why does the gradient of matrix product $AB$ w.r.t. $A$ equal $B^T$

derivativeslinear algebramatrix-calculusmultivariable-calculus

The below passage is from p. 215 of Deep Learning by Goodfellow, Bengio and Courville.

For example, we might use a matrix multiplication operation to create
a variable $C = AB$. Suppose that the gradient of a scalar $z$ with
respect to $C$ is given by $G$. The matrix multiplication operation is
responsible for defining two back-propagation rules, one for each of
its input arguments. If we call the bprop method to request the
gradient with respect to $A$ given that the gradient on the output is
$G$ , then the bprop method of the matrix multiplication operation
must state that the gradient with respect to A is given by $GB^T$.

They are applying chain rule to compute the gradient of scalar $z = f(C)$ with respect to $A$. I am unfamiliar with the idea of computing the gradient of a product of matrices with respect to a matrix. What does this mean, and why is the result transposed?

Best Answer

Given the gradient wrt $C$ $$\eqalign{\frac{\partial z}{\partial C} = G\cr\cr}$$ use the Frobenius Inner Product to write the differential $$\eqalign{ dz &= G:dC \cr &= G:dA\,B \cr &= GB^T:dA \cr\cr }$$ From which the gradient wrt $A$ can be identified as $$\eqalign{ \frac{\partial z}{\partial A} &= GB^T \cr\cr }$$ Note that Frobenius products can be re-arranged in various ways $$\eqalign{ A:BC &= BC:A \cr &= A^T:(BC)^T \cr &= AC^T:B \cr &= B^TA:C \cr &= {\rm tr}\big(A^TBC\big) \cr }$$ all of which can be verified directly, or by considering the trace-equivalence and the cyclic property of trace.