Why is $\epsilon^{*}=\frac{g^{\top} g}{g^{\top} H g}$ the best step size that decreases the 2nd-order Taylor series approx to the function $f (x)$

step functiontaylor expansion

I don't understand how we determinine the optimal step size that most decreases the Taylor approximation of the second-order Taylor series approximation to the function $f (x)$ around the current point $x^{(0)}$

The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function $f (x)$ around the current point $x^{(0)}$
$$f(x) \approx f\left(x^{(0)}\right)+\left(x-x^{(0)}\right)^{\top} g+\frac{1}{2}\left(x-x^{(0)}\right)^{\top} H\left(x-x^{(0)}\right)$$
where $g$ is the gradient and $H$ is the Hessian at $x^{(0)}$. If we use a learning rate of $\epsilon$, then the new point $x$ will be given by $x^{(0)}− \epsilon g$. Substituting this into ourapproximation, we obtain:
$$f\left(x^{(0)}-\epsilon g\right) \approx f\left(x^{(0)}\right)-\color{blue}{\epsilon g^{\top} g}+\color{red}{\frac{1}{2} \epsilon^{2} g^{\top} H g}
$$

There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When $g^THg$ is zero or negative, the Taylor series approximation predicts that increasing $\epsilon$ forever will decrease $f$ forever. In practice, the Taylor series is unlikely to remain accurate for large $\epsilon$, so one must resort to more heuristic choices of $\epsilon$ in this case. When $g^THg$ is positive,solving for the optimal step size that decreases the Taylor series approximation of the function the most yields:
$$\epsilon^{*}=\frac{g^{\top} g}{g^{\top} H g}$$

I don't know why it is so? Why when $g^tHg$ is positive, determining the optimal step size that most decreases the Taylor approximation of the function $f(x)$ is:

$$\epsilon^{*}=\frac{g^{\top} g}{g^{\top} H g}$$

Indeed, I guess we want the correction we must apply to account for the curvature of the function to be as low as possible, that is to say

\begin{align}\frac{1}{2} \epsilon^{2} g^{\top} H g &= \epsilon g^Tg\\
&\epsilon=\sqrt{\frac{2g^Tg}{g^THg}} \end{align}

Where am I wrong ? I am a really slow learner in mathematics. Don't hesitate to explain it to me with imaged sentences or as if I was a teenager.

Best Answer

It's simpler than you think. Look at the right hand side as a function of $\epsilon:$ $$h(\epsilon) = \dfrac 12 \epsilon^2g^THg - \epsilon g^Tg + f(x^{(0)}).$$ To make the gradient descent work fastest, you want this small as possible. But it's a quadratic in terms of $\epsilon$ and when the leading term $g^THg$ is positive, this quadratic is minimized at where the derivative is zero: $$0 = h'(\epsilon) = \epsilon g^THg - g^Tg $$ and you will get what you desired.

Related Question