Gradient matrix of loss function for single hidden layer neural network

gradient descentmatrix equationsneural networkspartial derivative

so I have a function $$\hat y=f(x)=\mathbf{w}_2^\mathsf{T}\pi(\mathbf z)$$

with $$\mathbf z=\mathbf W_1^\mathsf T\mathbf x$$ and $$\pi(x)={1\over1+e^{-x}}$$. As squared loss we use $$l={1\over 2}(y-\hat y)^2$$

now I want to find the gradient matrix of $l$ with respect to $\mathbf W_1$

I found a good article that helped my understand how to tackle the problem. Therefore I use the Chain rule and I came up with:
$${\partial l \over \partial\mathbf W_1}={\partial l \over \partial\hat y}{\partial \hat y \over \partial\pi}{\partial \pi \over \partial\mathbf z}{\partial \mathbf z \over \partial\mathbf W_1}$$

The first part should be $(y-\hat y)$

the second part $\mathbf w_2^\mathsf T$

and the third the derivative of sigmoid $\pi(\mathbf z)(1-\pi(\mathbf z))$

now I can't come up with the derivative for z with respect to W.
Can somebody help?

Best Answer

Start by taking the differential of the loss function,
then successively substitute the various variable definitions. $$\eqalign{ {\cal L} &= \tfrac{1}{2}(\hat y-y)^T(\hat y-y) \\ d{\cal L} &= (\hat y-y)^Td\hat y \\ &= (\hat y-y)^TW_2^Td\pi \\ &= (\hat y-y)^TW_2^T(P-P^2)\,dz \\ &= (\hat y-y)^TW_2^T(P-P^2)\,dW_1\,x \\ }$$ where the matrix $P={\rm Diag}(\pi)$.

Since a scalar expression is equal to its trace, i.e. $$y^TAx = {\rm Tr}(y^TAx) = {\rm Tr}(xy^TA)$$ we can rewrite the differential as $$\eqalign{ d{\cal L}={\rm Tr}\Big(x(\hat y-y)^TW_2^T(P-P^2)\,dW_1\Big) }$$ from which the gradient can be identified as $$\eqalign{ \frac{\partial{\cal L}}{\partial W_1} &= (P-P^2)W_2(\hat y-y)x^T \\ }$$ or the transpose of this, depending on your preferred layout convention

As you've discovered, the difficulty using the chain rule in matrix calculus is that it often requires the calculation of intermediate quantities which are higher-order tensors, e.g. matrix-by-matrix, matrix-by-vector, or vector-by-matrix derivatives.

The differential approach is simpler because the differential of a matrix behaves like an ordinary matrix. In particular, it obeys all of the rules of matrix algebra.

Best Answer

Related Solutions

Gradient of Cost Functions with Respect to Activations of Last Layer of Network (aka the predictions $\hat{y}$)

Related Question