Solved – Confusion about backpropagation – Matrix dimensions

backpropagationmachine learningneural networks

Following Andrew Ng's notation.
enter image description here

Suppose I wanted to implement a 4 layer neural network with the following weights,

$\Theta_1 \in\mathbb{M}_{5\times 4},\:\Theta_2 \in\mathbb{M}_{5\times 6},\:\Theta_3 \in\mathbb{M}_{4\times 6}$

Where the input $x\in\mathbb{R}^3$ and output $y\in\mathbb{R}^4$

My question is, when performing Backpropagation. One uses the formulae

$\delta^{(L)}=a^{(4)}-y$ , $\delta^{(l)} = \Theta^{(l)T}\delta^{(l+1)}\circ \sigma^\prime(z^{l})$

to calculate the "error" of layer $l$

Following the formulae, we get

$\delta^{(4)}\in\mathbb{R}^4 , \delta^{(3)} \in\mathbb{R}^6$

So far so good, but as we calculate $\delta^{(2)}$, the dimensions of $\Theta_2$ and $\delta^{(3)}$ don't match, namely

$\Theta_2^T\in\mathbb{M}_{6\times5} , \delta^{(3)}\in\mathbb{R}^6$

Is there anything I have done wrong?

Thanks!

Best Answer

Keep in mind why $\Theta_3$ is $4 \times 6$ rather than $4 \times 5$, even though the third layer has only $5$ nodes. It's because each node in the output layer takes the $5$ nodes as input plus an intercept. Remember that $\delta^{(3)}$ is the derivative of the error function with respect to each node in the third layer, prior to activation. One of your six $\delta^{(3)}$ components is the derivative with respect to the intercept, which has no dependence on any earlier part of the network, and thus has no further "backpropagating" to do. It's not even a relevant value to the calculation, because all you want is the derivative with respect to the weights that travel from the intercept to the outputs.

(I know it doesn't make sense to take a derivative with respect to a constant. However, what we're doing is treating the intercept as if it was an extra variable that always happens to have an observed value of 1. It's done that way for convenience, so we can place its weights in the same matrix as the other weights, rather than considering it separately.)

Thus in the second calculation you matrix multiply $\Theta_2^T$ with the five $\delta^{(3)}$ components that you care about. The ones corresponding to the actual nodes that take weight arguments from earlier in the network.