Derivative of matrix product with respect to column vector

derivativesmatricesmatrix-calculus

I have the following matrix product:

$Y=W^{(2)}\sigma(W^{(1)}X+b^{(1)})+b^{(2)}$

where $W^{(1)}\in R^{m\ \textrm{x}\ n}$, $W^{(2)}\in R^{(m-1)\ \textrm{x}\ n}$ $X\in R^{n\ \textrm{x}\ p}$ and $b^{(1)} \in R^m$ and $b^{(2)} \in R^{m-1}$. Moreover, $\sigma: R^{m\ \textrm{x}\ p} \Rightarrow R^{m\ \textrm{x}\ p}$ is any nonlinear function.

I am given the following result:

$\frac{\partial Y}{\partial X_j}=\sum_{i}w_{,i}^{(2)}\sigma'(I_i^{(1)})w_{ij}$

where $X_j$ is a column vector in $X$, $w_{,i}^{(2)}$ is the $i$-th column vector in $W^{(2)}$, $I_i^{(1)}=W^{(1)}X+b^{(1)}$. In am basically describing the derivative of the output of a neural networks (feedforward) with one hidden layer with respect to a specific column of the matrix $X$.
Can somebody please explain me how to get to this result? It is quite confusing to me due to the fact tht the derivative of a matrix product is taken with respect to one vector of one of the two initial matrices.

Best Answer

$\def\bb{\mathbb}$ Superscripts may be necessary in describing large neural networks, but here they are visual clutter and a source of confusion. So let variables with ${\tt1}$ as a superscript retain their names, but rename the others: $\;\big(W^{(2)},b^{(2)}\big)\to\big(U,c\big)$.

Another source of confusion is that the equation as written is dimensionally inconsistent, since it implicitly uses broadcasting. To create a dimensionally consistent equation, an all-ones vector $({\tt1}\in{\bb R}^{n})\,$ must be explicitly included.

It will also prove convenient to define the matrices $$\eqalign{ S &= \sigma\big(WX+b{\tt1}^T\big) \\ S' &= \sigma'\big(WX+b{\tt1}^T\big) \\ D &= {\rm Diag}\Big({\rm vec}(US')\Big) \\ }$$ With these changes the equation (and its differential) can be written as $$\eqalign{ Y &= US + c{\tt1}^T \\ dY &= (US') \odot (W\,dX) \\ }$$ where $\odot$ denotes the elementwise/Hadamard product.

As a final impediment to your comprehension, the quantity $\left(\frac{\partial Y}{\partial X_j}\right)$ is a matrix-by-vector derivative, i.e. a third-order tensor. There is no way to deal with such quantities using standard matrix notation.

To work within matrix notation, the expression needs to be vectorized,  after which the gradient can be calculated quite simply. $$\eqalign{ {\rm vec}(dY) &= {\rm vec}(US') \odot {\rm vec}(W\,dX) \\ &= D\,(I_p\otimes W)\,{\rm vec}(dX) \\ \frac{\partial{\rm vec}(Y)}{\partial{\rm vec}(X)} &= D\,(I_p\otimes W) \\ }$$ where $\otimes$ denotes the Kronecker product.

The above derivation makes use of the fact that a Hadamard product between two vectors is equivalent to converting one of the vectors into a diagonal matrix and then using normal matrix multiplication, i.e. $$a\odot b = {\rm Diag}(a)\,b\\$$ P.S.  The ${given}$ result is nearly incomprehensible. This is a sad state of affairs which is all too common in the field of Machine Learning / Neural Networks. A good book to learn this stuff properly is Matrix Differential Calculus by Magnus and Neudecker.

Related Question