Neural Networks – Understanding Backpropagation Final Layer Term

backpropagationneural networks

I'm trying to understand the calculation for the gradient of the blue weight shown in the NN below.

In Andrew Ng's Machine Learning coursera module, the δ term for the final layer of the NN is:

However, in other sources, the δ term for the final layer is:

where the red box has been added to the formula (the derivative of the sigmoid activation function).

Both sources then multply δ by the activation of the previous node to get the gradient:

What is the reason for the discrepancy between these two calculations? Could it be because Andrew assumes the activation function of the final layer is g(z) = z?

Best Answer

The difference is that in Andrew Ng's video, the Logistic regression cost function is used.

Conversely, the other source uses the Squared Error regression cost function.

The derivation of the delta term using the Logistic regression function is as follows:

Related Solutions

Solved – Intuition behind Backpropagation gradients

Backprop is used to compute the gradient of the loss function--that is, a vector containing the partial derivative of the loss function with respect to each parameter of the network. The mechanics of backprop are equivalent to the chain rule from calculus.

The gradient is used to update the weights according to some learning rule, whose job is reduce the value of the loss function. Many learning rules are possible, but one of the simplest and most widely used is gradient descent. The gradient at each point in parameter space is a vector that points in the direction in which the loss function increases most steeply. At each iteration, gradient descent takes a step in the direction opposite the gradient--that is, it steps in the direction of steepest descent, thereby reducing the loss function.

Solved – the significance of the Delta matrix in Neural Network Backpropagation

Let me try to tackle those questions one by one.

Before we start, let's ignore $\lambda$$\Theta^{l}_{ij}$ for now. The concepts are well understand without it, and you can tackle it after the rest feels clear.

Let's also pretend that bias terms don't exist.

Dimensions of $\Delta^{l}$: $\Delta^{l}$ is a matrix, and the dimensions of this matrix (assuming a fully connected neural net, which is what I think the tutorial is covering) is: $nrows$ = number of nodes in the next layer, and $ncolumns$ in the previous layer. Exceptions: For input layer #columns = #input features and output layer #rows=#output features. There's a confusing repetition of the letter i in the slide - it's used both to refer to iterating through examples $1$ to $m$ and to refer to an index of the $\Delta$ matrix/matrices. (Note you will sometimes see this matrix defined with the $nrows$ and $ncolumns$ swapped, i.e. the transpose. However your reference material doesn't seem to do that)

What would this look like for a 3 layered NN: I tend to think of it as 2 separate matrices $\Delta^{0}$ and $\Delta^{1}$. For a 3x3x3 NN, $\Delta^{0}$ would be 3x3 and $\Delta^{1}$ would be 3x3. If it was a 3x3x1 NN, $\Delta^{0}$ would be 3x3 but $\Delta^{1}$ would be 1x3 (I chose to index from 0, but you could index from 1), assuming the input is a column vector

Why the $\Delta$ is set to all 0 at the start: It's just to initialize. You haven't started calculating or "collecting the terms" to calculate the gradient yet, so you initialize to 0 before you start.

Significance of the updating: Back to the confusing repeated use of $i$. So we are passing every data point through the neural net, in every iteration of the loop going from $1$ to $m$. So in our first run through the loop, we only accumulate what we think is the gradient based on data point 1, $x^{(1)}$. But whoever bets the farm on 1 data point? So the next time through, we add $x^{(2)}$... and so on till we get to $x^{(m)}$ and exhaust our data.

But why is this $\Delta$ (after all the calculation) the gradient of the cost function with respect to the parameters?: Well you're taking the gradient of the error that each sample/data point has wrt to the parameters (each data point = one iteration through the for loop). All you're doing by adding is essentially averaging them all to get a better estimate of the gradient.

OK, but how are we deciding that adding up $a^{(l)}_j$$\delta^{(l+1)}_{i}$ turns into a gradient of $J(\Theta)$ (after dividing by $m$) : This one is tough to type up. To really understand I recommend penciling out a baby NN and working through it (doable if you have some, even rusty, calculus background). However, at this stage in the slides, I dont think you're expected to do that. The activation function isn't even given in the slide, which you need to actually do the derivation. You should be able to google for exercises others have blogged. For example: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ looks promising, though, full disclosure:I only leafed through quickly.

In summary So after all this work, you have now done backprop once, and have the gradient of the cost functions with respect to the various parameters stored in $\Delta^{0}$ through $\Delta^{(L-1)}$ for a L layered fully connected NN.

Also, i did need to refer: https://www.coursera.org/learn/machine-learning/supplement/pjdBA/backpropagation-algorithm to answer. Note they are also assuming a specific activation function, and get into details on later slides.

Finally, I made an assumption at the start that bias terms don't exist, because then the dimensions are easier to see. You'll need to expand the matrices between each layer to consume the bias term as well, which is a more normal construct

Best Answer

Related Solutions

Solved – Intuition behind Backpropagation gradients

Solved – the significance of the Delta matrix in Neural Network Backpropagation

Related Question