Solved – Deriving gradient of a single layer neural network w.r.t its inputs, what is the operator in the chain rule

gradientneural networks

Problem is:

Derive the gradient with respect to the input layer for a a single
hidden layer neural network using sigmoid for input -> hidden, softmax
for hidden -> output, with a cross entropy loss.

I can get through most of the derivation using the chain rule but I am uncertain on how to actually "chain" them together.

Define some notations

$ r = xW_1+b_1 $

$ h = \sigma\left( r \right) $, $\sigma$ is sigmoid function

$ \theta = hW_2+b_2 $,

$ \hat{y} = S \left( \theta \right) $, $S$ is softmax function

$ J\left(\hat{y}\right) = \sum_i y \log\hat{y}_i $ , $y$ is real label one-hot vector

Then by the chain rule,

$$
\frac{\partial J}{\partial \boldsymbol{x}}
= \frac{\partial J}{\partial \boldsymbol{\theta}} \cdot \frac{\partial \boldsymbol{\theta}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{r}} \cdot \frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}
$$

Individual gradients are:

$$
\frac{\partial J}{\partial \boldsymbol{\theta}} = \left( \hat{\boldsymbol{y}} – \boldsymbol{y} \right)
$$
$$
\frac{\partial \boldsymbol{\theta}}{\partial \boldsymbol{h}} = \frac{\partial}{\partial \boldsymbol{h}} \left[ \boldsymbol{h}W_2 + \boldsymbol{b_2}\right] = W_2^T
$$
$$
\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{r}} = h \cdot \left(1-h\right)
$$
$$
\frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}} = \frac{\partial}{\partial \boldsymbol{x}} \left[ \boldsymbol{x}W_1 + \boldsymbol{b_1}\right] = W_1^T
$$

Now we have to chain the definitions together. In single-variable this is easy, we just multiply everything together. In vectors, I'm not sure whether to use element-wise multiplication or matrix multiplication.

$$
\frac{\partial J}{\partial \boldsymbol{x}}
= \left( \hat{\boldsymbol{y}} – \boldsymbol{y} \right) * W_2^T \cdot \left[\boldsymbol{h} \cdot \left(1-\boldsymbol{h}\right)\right] * W_1^T
$$

Where $\cdot$ is element-wise multiplication of vectors, and $*$ is a matrix multiply. This combination of operations is the only way I could seem to string these together to get a $1 \cdot D_x$ dimension vector, which I know $\frac{\partial J}{\partial \boldsymbol{x}} $ has to be.

My question is: what is the principled way for me to figure out which operator to use? I'm specifically confused by the need for the element-wise one between $W_2^T$ and $h$.

Thanks!

Best Answer

I believe that the key to answering this question is to point out that the element-wise multiplication is actually shorthand and therefore when you derive the equations you never actually use it.

The actual operation is not an element-wise multiplication but instead a standard matrix multiplication of a gradient with a Jacobian, always.

In the case of the nonlinearity, the Jacobian of the vector output of the non-linearity with respect to the vector input of the non-linearity happens to be a diagonal matrix. It's therefore true that the gradient multiplied by this matrix is equivalent to the gradient of the output of the nonlinearity with respect to the loss element-wise multiplied by a vector containing all the partial derivatives of the nonlinearity with respect to the input of the nonlinearity, but this follows from the Jacobian being diagonal. You must pass through the Jacobian step to get to the element-wise multiplication, which might explain your confusion.

In math, we have some nonlinearity $s$, a loss $L$, and an input to the nonlinearity $x \in \mathbb{R}^{n \times 1}$ (this could be any tensor). The output of the nonlinearity has the same dimension $s(x) \in \mathbb{R}^{n \times 1}$---as @Logan says, the activation function are defined as element-wise.

We want $$\nabla_{x}L=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L$$

Where $\dfrac{\partial s(x)}{\partial x}$ is the Jacobian of $s$. Expanding this Jacobian, we get \begin{bmatrix} \dfrac{\partial{s(x_{1})}}{\partial{x_1}} & \dots & \dfrac{\partial{s(x_{1})}}{\partial{x_{n}}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial{s(x_{n})}}{x_{1}} & \dots & \dfrac{\partial{s(x_{n})}}{\partial{x_{n}}} \end{bmatrix}

We see that it is everywhere zero except for the diagonal. We can make a vector of all its diagonal elements $$Diag\left(\dfrac{\partial s(x)}{\partial x}\right)$$

And then use the element-wise operator.

$$\nabla_{x}L =\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L =Diag\left(\dfrac{\partial s(x)}{\partial x}\right) \circ \nabla_{s(x)}L$$

Related Question