Solved – Backpropagation proof and usage confusion

backpropagationmachine learningmathematical-statisticsneural networks

I've been taking Andrew Ng's course on Coursera, and although it has been great so far, I loathe his lack of supplementary documents on proofs. Thankfully, there are some great articles found pretty good article on the backpropagation proof, namely, this one by Sudeep Raja.

There are some notation differences and even some formula differences that are really throwing me off and would like more insight. Here is the slide specific to Andrew's lecture.

For notation's sake, assume there are three layers, with $\Theta^1, \Theta^2, \Theta^3$ as the weights. $a^1=x$ is the input, $a^2$ is the output of layer 1, and $a^4$ is the expected output of layer 3 and the final output.

$\Delta^1$ is $\partial J/\partial\Theta^1$, $\Delta^2$ is $\partial J/\partial\Theta^2$, …

Note: This notation is identical to Andrew's but different from Sudeep's.

Andrew has:

$\delta^{(4)} = a^4-\vec y$,
$\delta^3=(\Theta^3)^T\delta^4.*g\prime(z^3)$,
$\Delta^3=\delta^4(a^4)^T$

Sudeep has

$\delta^{(4)} = (a^4-\vec y).*g\prime(z^4)$,
$\delta^3=(\Theta^3)^T\delta^4.*g\prime(z^3)$,
$\Delta^3=\delta^4(a^4)^T$

Notice the difference in $\delta^4$. Why?? Update: Answered in edits.

Also, in Sudeep's proof, when he's deriving $\partial E/\partial W_2$, he got
$$
\frac{\partial E}{\partial W_2} = \delta_3 \frac{\partial}{\partial W_2}(W_3x_2) = W_3^T\delta_3\frac{\partial}{\partial W_2}x_2
$$

$W_3$ is a constant so it comes out, but why is it transposed? I understand that you have to to get the dimensions to line up, but I don't think that's a sufficient argument as to why you do it.

Lastly, in Sudeep's proof, when he's deriving $\partial E/\partial W_3$, he got
$$
\frac{\partial}{\partial W_3}(W_3x_2) = x_2^T
$$
Again, why is $x_2$ transposed?

Edits

After more googling, I've learned that the reason Andrew's solution doesn't have the $g(z^4)$ is because he's using the cross-entropy cost function, explained further here, which leads to $g\prime(z)$ being canceled.

Best Answer

Here, the derivatives are taken with respect to the vector, as a result, rules of differentiation are slightly different. My suggestion is when in doubt, perform derivation for each index of the vector and then extend to the general case. For example, to answer the second question you can find derivatives for each index, e.g. $\frac{\partial E} {\partial W^{j}_2}$ and then generalize. Similarly for $\frac{\partial E} {\partial W^{j}_3}$

Backpropagation derivatives

Notation:

$i^k$ - input vector of $k$th layer
$o^k$ - output vector of $k$th layer
$W^k$ - transition matrix of $k$th layer
$b^k$ - biases of $k$th layer
$X = o^0$ - network input
$T$ - predicted output
$T = o^n$ assuming we have $n$ layers

We have a transition between layers

$$i^{k} = W^ko^{k-1} + b^k$$

$ReLU$ as activation function

$$o^k = max(0, i^k)$$

where we assume element-wise application.

Now the important part

$$\frac{\partial L}{\partial o^{k-1}} = W^k\frac{\partial L}{\partial i^k} \ \ \ \ \text{ (1)}$$

$$ \frac{\partial L}{\partial i^k} = \frac{\partial L}{\partial o^k} \circ I[o^k > 0] \ \ \ \ \text{ (2)} $$

where $\circ$ means Hadamard product (element-wise)

$$\frac{\partial L}{\partial W^k} = \frac{\partial L}{\partial i^k}(o^{k-1})^T$$

$$\frac{\partial L}{\partial b^k} = \frac{\partial L}{\partial i^k} \ \ \ \ \text{ (3)}$$

Since $\frac{\partial i^k}{\partial b^k} = I$ (identity matrix) and we use chain rule

You can check that above equations hold when $i, o$ are matrices, i.e. we process whole batches at once. Now we can also update weights for a whole batch as follows

\begin{align}\frac{\partial \sum_{l=1}^d L_{(l)}}{\partial W^k} &= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial W^k} \\ &= \sum_{l=1}^d \frac{\partial L_{(l)}}{\partial i_{(l)}^k}(o_{(l)}^{k-1})^T \\ &= \begin{pmatrix} \frac{\partial L_{(1)}}{\partial i_{(1)}^k} & \dots & \frac{\partial L_{(d)}}{\partial i_{(d)}^k} \end{pmatrix} \begin{pmatrix} o_{(1)}^{k-1})^T \\ \dots \\ o_{(d)}^{k-1})^T \end{pmatrix} \\ &= \frac{\partial \sum_{l=1}^d L_{(l)}}{\partial I^k} (O^{k-1})^T \ \ \ \ \text{ (4)} \end{align}

dscores <- dscores / batchsize

Since we process a whole batch at once and derivative of ReLU is constant we can divide $T - Y$ at start

dW2 <- t(hidden.layer) %*% dscores # [6 x 3] matrix

Using (4)

db2 <- colSums(dscores)            # a vector of 3 numbers (sums for each species)

Apply (3) to a whole batch

dhidden <- dscores %*% t(W2) # [90 x 6] matrix]
# W2 is a [6 x 3] matrix of weights.
# dhidden is [90 x 6]

Applying (1)

dhidden[hidden.layer <= 0] <- 0 # We get rid of negative values

Using (2)

dW1 <- t(X) %*% dhidden # X is the input matrix [90 x 4]
db1 <- colSums(dhidden) # 6-element vector (sum columns across examples)

Again using (4) and (3) to a whole batch for layer 1

# update ....
dW2 <- dW2  + reg * W2 # reg is regularization rate reg = 1e-3
dW1 <- dW1  + reg * W1

W1  <- W1 - lr * dW1    # lr is the learning rate lr = 1e-2
b1  <- b1 - lr * db1    # b1 is the first bias

W2  <- W2 - lr * dW2
b2  <- b2 - lr * db2    # b2 is the second bias

Updates for a whole batch

Solved – Confusion about backpropagation – Matrix dimensions

Keep in mind why $\Theta_3$ is $4 \times 6$ rather than $4 \times 5$, even though the third layer has only $5$ nodes. It's because each node in the output layer takes the $5$ nodes as input plus an intercept. Remember that $\delta^{(3)}$ is the derivative of the error function with respect to each node in the third layer, prior to activation. One of your six $\delta^{(3)}$ components is the derivative with respect to the intercept, which has no dependence on any earlier part of the network, and thus has no further "backpropagating" to do. It's not even a relevant value to the calculation, because all you want is the derivative with respect to the weights that travel from the intercept to the outputs.

(I know it doesn't make sense to take a derivative with respect to a constant. However, what we're doing is treating the intercept as if it was an extra variable that always happens to have an observed value of 1. It's done that way for convenience, so we can place its weights in the same matrix as the other weights, rather than considering it separately.)

Thus in the second calculation you matrix multiply $\Theta_2^T$ with the five $\delta^{(3)}$ components that you care about. The ones corresponding to the actual nodes that take weight arguments from earlier in the network.

Edits

Best Answer

Related Solutions

Solved – Backpropagation algorithm NN with Rectified Linear Unit (ReLU) activation

Backpropagation derivatives

Solved – Confusion about backpropagation – Matrix dimensions

Related Question