Help with differentiation chain rule with tensors in backpropagation

chain rulederivativesmachine learningtensors

Say, we're given $N$ feature vectors $\mathbf{x}_i \in \mathbb{R}^{D \times 1}$ and assembled into a matrix $X \in \mathbb{R}^{D \times N}$. We also have a matrix $W \in \mathbb{R}^{D \times D}$, $W = XX^\top$ and a predictor matrix $Y \in \mathbb{R}^{D \times N}$, $Y=WX$. Say, we have a scalar function, e.g., $\ell = \left\lVert Y \right\lVert_F$. We need to compute the gradients of $\ell$ using back-propagation.

We know that $\frac{\partial \ell}{\partial Y}$ is a matrix. $\frac{\partial \ell}{\partial W}$ is also a matrix and $\frac{\partial \ell}{\partial W} = \frac{\partial \ell}{\partial Y} \frac{\partial Y}{\partial W}$ (by the chain rule).

Here's the problem: $\frac{\partial Y}{\partial W}$ is a 4-tensor and multiplying a matrix by a 4-tensor gives a 4-tensor (albeit, potentially of a different shape), NOT a matrix (as $\frac{\partial \ell}{\partial W}$ should be)!

Obviously, I'm doing something wrong. The question is – what?

Thx

Best Answer

The kind of "multiplication" used here is $\frac{\partial\ell}{\partial W_{ab}}=\sum_{cd}\frac{\partial\ell}{\partial Y_{cd}}\frac{\partial Y_{cd}}{\partial W_{ab}}$, where nobody writes the $\sum_{cd}$.

Related Question