Solved – Gradients of cross-entropy error in neural network

classificationneural networks

Neural network with a single hidden layer of logistic units being used for a multi–class classification problem:

\begin{align}
h &= \sigma (W^{(1)} x+b^{(1)}) \\[5pt]
\hat y &= {\rm softmax}(W^{(2)}h + b^{(2)})
\end{align}

and trained using the cross–entropy error:

$$
C(y,\hat y) = -\sum_i y_i \log \hat y_i
$$

I need to find the gradients of the error with respect to the parameters in the first layer, i.e., the layer closest to the input. The output target $y$ is a one-hot representation.

Was given this additional info:
$$
\frac{\partial C}{\partial z} = y – \hat y
$$
where
$$
z = W^{(2)}h + b^{(2)}
$$

Best Answer

Use the chain rule,

$$ \frac{\partial C}{\partial W_1} = \frac{\partial C}{\partial z} \frac{\partial z}{\partial h} \frac{\partial h}{\partial a} \frac{\partial a}{\partial W_1} , $$ where
$$ a = W_1x+b_1 . $$

$$ \frac{\partial C}{\partial z} = y-\hat{y}, $$

$$ \frac{\partial z}{\partial h} = W_2, $$

$$ \frac{\partial h}{\partial a} = \sigma(a)(1 - \sigma(a)), $$

$$ \frac{\partial a}{\partial W_1} = x, $$ so $$ \frac{\partial C}{\partial W_1} = (y-\hat{y})W_2\sigma(a)(1 - \sigma(a))x = (y-\hat{y})W_2h(1 - h)x. $$

You could take a look at Pattern Recognition and Machine Learning Section 5.3 for details.