Solved – Backpropagation proof and usage confusion

backpropagationmachine learningmathematical-statisticsneural networks

I've been taking Andrew Ng's course on Coursera, and although it has been great so far, I loathe his lack of supplementary documents on proofs. Thankfully, there are some great articles found pretty good article on the backpropagation proof, namely, this one by Sudeep Raja.

There are some notation differences and even some formula differences that are really throwing me off and would like more insight. Here is the slide specific to Andrew's lecture.

For notation's sake, assume there are three layers, with $\Theta^1, \Theta^2, \Theta^3$ as the weights. $a^1=x$ is the input, $a^2$ is the output of layer 1, and $a^4$ is the expected output of layer 3 and the final output.

$\Delta^1$ is $\partial J/\partial\Theta^1$, $\Delta^2$ is $\partial J/\partial\Theta^2$, …

Note: This notation is identical to Andrew's but different from Sudeep's.

Andrew has:

  • $\delta^{(4)} = a^4-\vec y$,
  • $\delta^3=(\Theta^3)^T\delta^4.*g\prime(z^3)$,
  • $\Delta^3=\delta^4(a^4)^T$

Sudeep has

  • $\delta^{(4)} = (a^4-\vec y).*g\prime(z^4)$,
  • $\delta^3=(\Theta^3)^T\delta^4.*g\prime(z^3)$,
  • $\Delta^3=\delta^4(a^4)^T$

Notice the difference in $\delta^4$. Why?? Update: Answered in edits.

Also, in Sudeep's proof, when he's deriving $\partial E/\partial W_2$, he got
$$
\frac{\partial E}{\partial W_2} = \delta_3 \frac{\partial}{\partial W_2}(W_3x_2) = W_3^T\delta_3\frac{\partial}{\partial W_2}x_2
$$

$W_3$ is a constant so it comes out, but why is it transposed? I understand that you have to to get the dimensions to line up, but I don't think that's a sufficient argument as to why you do it.

Lastly, in Sudeep's proof, when he's deriving $\partial E/\partial W_3$, he got
$$
\frac{\partial}{\partial W_3}(W_3x_2) = x_2^T
$$
Again, why is $x_2$ transposed?

Edits

After more googling, I've learned that the reason Andrew's solution doesn't have the $g(z^4)$ is because he's using the cross-entropy cost function, explained further here, which leads to $g\prime(z)$ being canceled.

Best Answer

Here, the derivatives are taken with respect to the vector, as a result, rules of differentiation are slightly different. My suggestion is when in doubt, perform derivation for each index of the vector and then extend to the general case. For example, to answer the second question you can find derivatives for each index, e.g. $\frac{\partial E} {\partial W^{j}_2}$ and then generalize. Similarly for $\frac{\partial E} {\partial W^{j}_3}$