Derivative w.r.t. x of Matrix Product $A(x)B(x)$

If $A(x)$ was a row vector and $B(x)$ was a column vector, below is true
$$
\frac{d(A(x)B(x))}{dx}=\frac{dA(x)}{dx}B(x)+\frac{dB^t(x)}{dx}A(x)^t.
$$
If you take derivative of $d_1\times d_2$ matrix w.r.t. a length $p$ column vector you get $p\times d_1\times d_2$ tensor.
If you take derivative of $d_1\times d_2$ matrix w.r.t. a length $p$ row vector you get $d_1\times d_2\times p$ tensor.
Let's say $A(x)$ is $d_1\times d_2$, $B(x)$ is $d_2\times d_3$. Then dimensions are
$$
\frac{d(A(x)B(x))}{dx}\qquad p\times d_1\times d_3\\
\frac{dA(x)}{dx}B(x)\qquad (p\times d_1\times d_2)\cdot (d_2\times d_3)\\
\frac{dB^t(x)}{dx}A(x)^t\qquad (p\times d_3\times d_2)\cdot (d_2\times d_1).
$$
The last term with dimension $p\times d_3\times d_1$ is not what we want. Nor is it a transpose of what we want. Furthermore, if you take a derivative on $B(x)$ when you look at the dimension, $p$ always has to be next to $d_2$ or $d_3$ and can never be next to $d_1$. So it seems this issue is unresolvable. Any idea how to circumvent this?

Best Answer

Index notation illustrates the problem quite well.

Denoting $\frac{\partial}{\partial x_n}$ by $\partial_n$, we have $$\eqalign{ C_{ik} &= A_{ij}B_{jk} \cr \partial_nC_{ik} &= \big(\partial_n A_{ij}\big)B_{jk} + A_{ij}\big({\partial_n B_{jk}}\big) \cr }$$ The first term is fine for the contraction over the $j$-index, but in the second term the index is sandwiched between the $n$ and $k$ indices. Since the term is a third-order tensor, there is no way to fix it. This is unlike the following case.

In the case where $(A,B)$ are vectors, simply omit the $(i,k)$ indices to obtain $$\eqalign{ C &= A_{j}B_{j} \cr \partial_nC &= \big(\partial_n A_{j}\big)B_{j} + A_{j}\big({\partial_n B_{j}}\big) \cr }$$ This expression can be fixed by transposing the second term.

Generalizing in the other direction, if $X$ is a matrix, $\frac{\partial}{\partial X_{nm}}=\partial_{nm}$, and appending the $m$-index results in $$\eqalign{ \partial_{nm}C_{ik} &= \big(\partial_{nm} A_{ij}\big)B_{jk} + A_{ij}\big({\partial_{nm} B_{jk}}\big) \cr }$$ Now the gradients in parentheses are fourth-order tensors, and once again there is no simple operation that will re-order the indices.

The simplest way to avoid the problem is to use differentials instead of gradients. $$\eqalign{ C &= A\star B \cr dC &= dA\star B + A\star dB \cr }$$ where $(A,B,C)$ can be scalars, vectors, or tensors, and $(\star)$ can be any kind of product (Kronecker, Hadamard, Frobenius, tensor, matrix) which is compatible with their given dimensions.

Best Answer

Related Solutions

[Math] Generalization of chain rule to tensors

Derivative of row-wise softmax matrix w.r.t. matrix itself

Related Question