[Math] Partial derivative of matrix product in neural network

calculuslinear algebramatricesmatrix-calculusneural networks

I'm reading a book about neural network. In the section about back-propagation of an Affine-layer of the network, the author provides a formula and omits the details.

Say $$\mathbf{X}\cdot\mathbf{W}=\mathbf{Y},$$

which $\mathbf{X}$ is of dimension $(N, 2)$, $\mathbf{W}$ of $(2, 3)$ respectively. The loss function, say $L$, will do some modification to the $\mathbf{Y}$. Then the provided formula is:

$$\begin{align}\\
\frac{\partial L}{\partial\mathbf{X}} &= \frac{\partial L}{\partial\mathbf{Y}}\cdot\mathbf{W}^\mathrm{T},\\
\frac{\partial L}{\partial\mathbf{W}} &= \mathbf{X}^\mathrm{T}\cdot\frac{\partial L}{\partial\mathbf{X}}.\\
\end{align}$$

What I really want to know first is that what's the dimension of $\Large\frac{\partial\mathbf{Y}}{\partial\mathbf{X}}$? Since

$$\large y_{ij} = \sum^{2}x_{ik}\cdot w_{kj},$$

so for each $y_{ij}$ there are two related $x_{ik}$ to with respect to? I'm confused at this point. What's the definition of the partial derivative of a matrix multiplication/product? And I don't know why there come up with the transposes $\mathbf{W}^\mathrm{T}$ and $\mathbf{X}^\mathrm{T}$?

I draw a picture about this process(and I omit the $\mathbf{B}$, which means bias but since it's out of concern here I assume it be zero matrix):

enter image description here

Best Answer

Suppose we are given a loss function $L=L(Y)\,$ and its gradient $\,G=\frac{\partial L}{\partial Y}$

Then suppose we are told that $Y$ in turn, depends on two other variables $$\eqalign{ Y &= XW \cr dY &= dX\,W+X\,dW \cr }$$ Let's find the differential of the loss function with respect to these two variables. $$\eqalign{ dL &= G:dY \cr &= G:dX\,W + G:X\,dW \cr &= GW^T:dX + X^TG:dW \cr }$$ Holding $W$ constant means that $dW=0$, and yields the gradient $$\eqalign{ \frac{\partial L}{\partial X} &= GW^T \cr }$$ Similarly, holding $X$ constant yields the gradient wrt $W$ $$\eqalign{ \frac{\partial L}{\partial W} &= X^TG \cr\cr }$$ In some of the steps above, a colon was used to denote the trace/Frobenius product $$A:B = {\rm tr}(A^TB)$$

Related Question