As for the confusing part. Softmax derivative is simply
$$\frac{\partial L}{\partial t_i} = t_i - y_i$$
where $t_i$ is predicted output. Now, in this case $t_i, y_i \in \Re^3$, but $y_i$ has to be in the one-hot encoding form which looks like this
$$y_i = (0, \dots, \overset{\text{k'th}}{1}, \dots, 0)$$
So, for example class 2 is represented as $(0, 1, 0)$
And in your code we have $y.index$ be the position of $1$ in above example. So the line
dscores[Y.index] <- dscores[Y.index] - 1
Is a short and clever way for applying derivative to a whole dataset, i.e.
$$t_i - y_i \equiv t_i[k] = t_i[k] - 1, \text{ for } y_i = (0, \dots, \overset{\text{k'th}}{1}, \dots, 0)$$
Backpropagation derivatives
Notation:
- $i^k$ - input vector of $k$th layer
- $o^k$ - output vector of $k$th layer
- $W^k$ - transition matrix of $k$th layer
- $b^k$ - biases of $k$th layer
- $X = o^0$ - network input
- $T$ - predicted output
- $T = o^n$ assuming we have $n$ layers
We have a transition between layers
$$i^{k} = W^ko^{k-1} + b^k$$
$ReLU$ as activation function
$$o^k = max(0, i^k)$$
where we assume element-wise application.
Now the important part
$$\frac{\partial L}{\partial o^{k-1}} = W^k\frac{\partial L}{\partial i^k} \ \ \ \ \text{ (1)}$$
$$
\frac{\partial L}{\partial i^k} = \frac{\partial L}{\partial o^k} \circ I[o^k > 0] \ \ \ \ \text{ (2)}
$$
where $\circ$ means Hadamard product (element-wise)
$$\frac{\partial L}{\partial W^k} = \frac{\partial L}{\partial i^k}(o^{k-1})^T$$
$$\frac{\partial L}{\partial b^k} = \frac{\partial L}{\partial i^k} \ \ \ \ \text{ (3)}$$
Since $\frac{\partial i^k}{\partial b^k} = I$ (identity matrix) and we use chain rule
You can check that above equations hold when $i, o$ are matrices, i.e. we process whole batches at once. Now we can also update weights for a whole batch as follows
\begin{align}\frac{\partial \sum_{l=1}^d L_{(l)}}{\partial W^k}
&= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial W^k} \\
&= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial i_{(l)}^k}(o_{(l)}^{k-1})^T \\
&= \begin{pmatrix}
\frac{\partial L_{(1)}}{\partial i_{(1)}^k} &
\dots &
\frac{\partial L_{(d)}}{\partial i_{(d)}^k}
\end{pmatrix}
\begin{pmatrix}
o_{(1)}^{k-1})^T \\
\dots \\
o_{(d)}^{k-1})^T
\end{pmatrix} \\
&= \frac{\partial \sum_{l=1}^d L_{(l)}}{\partial I^k} (O^{k-1})^T
\ \ \ \ \text{ (4)}
\end{align}
dscores <- dscores / batchsize
Since we process a whole batch at once and derivative of ReLU is constant we can divide $T - Y$ at start
dW2 <- t(hidden.layer) %*% dscores # [6 x 3] matrix
Using (4)
db2 <- colSums(dscores) # a vector of 3 numbers (sums for each species)
Apply (3) to a whole batch
dhidden <- dscores %*% t(W2) # [90 x 6] matrix]
# W2 is a [6 x 3] matrix of weights.
# dhidden is [90 x 6]
Applying (1)
dhidden[hidden.layer <= 0] <- 0 # We get rid of negative values
Using (2)
dW1 <- t(X) %*% dhidden # X is the input matrix [90 x 4]
db1 <- colSums(dhidden) # 6-element vector (sum columns across examples)
Again using (4) and (3) to a whole batch for layer 1
# update ....
dW2 <- dW2 + reg * W2 # reg is regularization rate reg = 1e-3
dW1 <- dW1 + reg * W1
W1 <- W1 - lr * dW1 # lr is the learning rate lr = 1e-2
b1 <- b1 - lr * db1 # b1 is the first bias
W2 <- W2 - lr * dW2
b2 <- b2 - lr * db2 # b2 is the second bias
Updates for a whole batch
Keep in mind why $\Theta_3$ is $4 \times 6$ rather than $4 \times 5$, even though the third layer has only $5$ nodes. It's because each node in the output layer takes the $5$ nodes as input plus an intercept. Remember that $\delta^{(3)}$ is the derivative of the error function with respect to each node in the third layer, prior to activation. One of your six $\delta^{(3)}$ components is the derivative with respect to the intercept, which has no dependence on any earlier part of the network, and thus has no further "backpropagating" to do. It's not even a relevant value to the calculation, because all you want is the derivative with respect to the weights that travel from the intercept to the outputs.
(I know it doesn't make sense to take a derivative with respect to a constant. However, what we're doing is treating the intercept as if it was an extra variable that always happens to have an observed value of 1. It's done that way for convenience, so we can place its weights in the same matrix as the other weights, rather than considering it separately.)
Thus in the second calculation you matrix multiply $\Theta_2^T$ with the five $\delta^{(3)}$ components that you care about. The ones corresponding to the actual nodes that take weight arguments from earlier in the network.
Best Answer
Here, the derivatives are taken with respect to the vector, as a result, rules of differentiation are slightly different. My suggestion is when in doubt, perform derivation for each index of the vector and then extend to the general case. For example, to answer the second question you can find derivatives for each index, e.g. $\frac{\partial E} {\partial W^{j}_2}$ and then generalize. Similarly for $\frac{\partial E} {\partial W^{j}_3}$