Solved – How does the second derivative inform an update step in Gradient Descent

deep learninggradient descenthessianneural networksoptimization

I was reading the deep learning book by Begnio, Goodfellow and Courville and there was one section where they explain the second derivative that I don't understand (section 4.31):

The second derivative tells us how the first derivative will change as we vary the input. This is important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the gradient alone.

the part in bold is what does not make sense to me.

I think I do understand what the second derivative means. The second derivative simply measures how much the gradient/tangent slope $f'(x)$ changes as we make small changes in $x$. i.e. how small changes in $x$ changes the gradient $f'(x)$. So for example, if we had a large second derivative a we made a tiny move, then the tangent line should change a lot. Thus it makes sense it measures the speed at which a tangent line becomes steep and thus why its sometimes referred as a measure of the curvature. That makes sense.

What does NOT make sense to me is that part in bold. How does it inform us at all about if a gradient step would make as much of an improvement as we would expect based on the gradient alone? I don't think I even understand what that means in english as a sentence. The main reason that it seems confusing to me is that a gradient step is independent of the second derivative, so a gradient step will change the target function whatever amount it has to. In fact the second derivative is the rate of change of the derivative and doesn't seem to hold direct info on the target function we are trying to optimize so I don't know what it means with:

gradient step will cause as much as an improvement as we would expect
based on the gradient alone.

can someone explain me what this means?

Best Answer

I agree with your distaste for the writing. It seems as though you have an understanding of what is going on, but I will attempt to clarify why the second derivative is important. Consider a two-dimensional orthogonal system. Since they are orthogonal we can look at them independently, and together. This need not be the case, but I use the orthogonal system to avoid the linear algebra which may muddy the intuition .

In the $x_1$ dimension, the objective, f, varies roughly as $f=x_1^2$. In the $x_2$ dimension, the objective varies as $f = .00001x_2^2$. The minima is f(0,0) = 0.

This is the gradient descent update in each dimension:

  • $x_{1,k+1} = x_{1,k} - 2\alpha x_{1,k} $

  • $x_{2,k+1} = x_{2,k} - .00002\alpha x_{2,k} $

Where $\alpha$ is the learning rate. That is, According to a gradient descent update, if you start at about (1,1) then after a few iterations you will be at $\approx (0,1)$ because the gradient in the $x_2$ direction is already very near zero. True, we may have predicted this based on the fact that the gradient at every point in the $x_2$ direction is near zero, but it is still undesirable -- I think that is the point they were trying to make in the bolded sentence.

Now we note that $\frac{\partial^2 f}{\partial x_2^2} = .00002$. $\frac{\partial^2 f}{\partial x_1^2} = 2$. Dividing by this amounts to accounting for the curvature (or lack thereof) in that dimension. Now let's solve these two 1-d functions using second order information. Recall that the form is $x_{k+1} = x_k - \alpha \frac{\partial^2 f}{\partial x_k^2}^{-1} \frac{\partial f}{\partial x_k}$:

  • $x_{1,k+1} = x_{1,k} - \alpha x_{1,k} $

  • $x_{2,k+1} = x_{2,k} - \alpha x_{2,k} $

That is, they are converging at the same rate, exactly as we would hope they would!

Related Question