This uses the same basic idea as Christian Blatter's answer. Observe that
$$
f(x,\epsilon) \;=\; g_3(x)\epsilon^3 \,+\, g_4(x)\epsilon^4 \,+\, g_5(x)\epsilon^5 \,+\, \cdots
$$
where each $g_n(x)$ is the product of a polynomial with $e^{-x^2/2}$:
$$
\begin{align*}
g_3(x) \;&=\; \frac12 \bigl(-x^3 + x^2 + 5x -1\bigr)e^{-x^2/2} \\[6pt]
g_4(x) \;&=\; \frac{1}{24} \bigl(-3x^5+46x^3-12x^2-93x+12)e^{-x^2/2} \\[6pt]
g_5(x) \;&=\; \frac{1}{48}\bigl(-x^7 + 27 x^5 - 10 x^4 - 165 x^3 + 84 x^2 + 195 x - 54\bigr)e^{-x^2/2} \\[6pt]
g_6(x) \;&=\; \left(\tfrac{- 5 x^9 + 220 x^7 -
120 x^6 - 2714 x^5 + 2280 x^4 + 10020 x^3 - 8280 x^2 - 7725 x + 3240}{1920}\right)e^{-x^2/2}\\
&\;\vdots
\end{align*}
$$
The function $f(x,\epsilon)$ is zero along three curves:
These curves hit the $x$-axis at the three roots of the polynomial $x^3 - x^2 - 5x + 1$. It is easy to check that $\dfrac{\partial}{\partial \epsilon}[f(x,\epsilon)/\epsilon^3]\ne 0$ at these three points, so the picture really does look like this in some neighborhood of the $x$-axis, and the intersections really are transverse.
If we write the three curves as equations of the form
$$
x \;=\; \psi_1(\epsilon),\qquad x=\psi_2(\epsilon),\qquad x=\psi_3(\epsilon)
$$
Note that the functions $\psi_1$, $\psi_2$, and $\psi_3$ are $C^\infty$, by the Implicit Function Theorem. Then
$$
\Delta(\epsilon) \;=\; \int_{-\infty}^{\psi_1(\epsilon)} \!\!f(x,\epsilon)\,dx \,-\, \int_{\psi_1(\epsilon)}^{\psi_2(\epsilon)} \!\!f(x,\epsilon)\,dx \,+\, \int_{\psi_2(\epsilon)}^{\psi_3(\epsilon)} \!\!f(x,\epsilon)\,dx \,-\, \int_{\psi_3(\epsilon)}^{\infty} \!\!f(x,\epsilon)\,dx
$$
We can use this to derive the first few terms of a Taylor expansion for $\Delta(\epsilon)$.
Specifically, we have
$$
\Delta(\epsilon) \;=\; \kappa \epsilon^3 \,+\, \lambda \epsilon^4 \,+\,\mu\epsilon^5 \,+\, o(\epsilon^5)
$$
for some constants $\kappa$, $\lambda$, $\mu$. There is a nice formula for $\kappa$:
$$
\begin{align*}
\kappa \;&=\; \int_{-\infty}^\infty |g_3(x)|\,dx \\[6pt]
&=\; e^{-\alpha^2/2}(\alpha^2-\alpha-3)\,-\,e^{-\beta^2/2}(\beta^2-\beta-3)\,+\,e^{-\gamma^2/2}(\gamma^2-\gamma-3) \\[6pt]
&\approx\; 3.5519079
\end{align*}
$$
where $\alpha<\beta<\gamma$ are the three roots of the polynomial $x^3 - x^2 - 5x + 1$.
The formula for $\lambda$ is similarly nice:
$$
\begin{align*}
\lambda \;&=\; \int_{-\infty}^\infty \frac{g_3(x) g_4(x)}{|g_3(x)|}dx \\[6pt]
&=\; e^{-\alpha^2/2}p(\alpha)\,-\,e^{-\beta^2/2}p(\beta)\,+\,e^{-\gamma^2/2}p(\gamma) \\[6pt]
&\approx\; -3.307248
\end{align*}
$$
where $p(x) = \dfrac{1}{12}\bigl(3x^4-34x^2+12x+25\bigr)$.
Things get a little bit dicey after that, since the values of $\psi_1'(0)$, $\psi_2'(0)$, and $\psi_3'(0)$ come into play. In particular,
$$
\begin{align*}
\mu \;&=\; \int_{-\infty}^\infty \frac{g_3(x) g_5(x)}{|g_3(x)|}dx \,+\, \bigl(2g_4(\alpha)+g_3'(\alpha)\bigr)\psi_1'(0) \\[6pt]
&\qquad
-\, \bigl(2g_4(\beta)+g_3'(\beta)\bigr)\psi_2'(0)
+\, \bigl(2g_4(\gamma)+g_3'(\gamma)\bigr)\psi_3'(0)
\end{align*}
$$
It's possible to compute the values of $\psi_1'(0)$, $\psi_2'(0)$, and $\psi_3'(0)$ by examining the gradient of $f(x,\epsilon)/\epsilon^3$ near the points $(\alpha,0)$, $(\beta,0)$, and $\gamma(0)$. The result is that
$$
\psi_1'(0) = q(\alpha),\qquad \psi_2'(0)=q(\beta),\qquad \psi_3'(0)=q(\gamma)
$$
where
$$
q(x) \;=\; \frac{3x^5-46x^3+12x^2+93x-12}{12(x^4-x^3-8x^2+3x+5)}.
$$
In particular,
$$
\psi_1'(0)\approx -0.832825,\qquad \psi_2'(0) \approx 0.0971987,\qquad \psi_3'(0)\approx 1.06896.$$
Using these formulas, I'm getting that $\mu\approx 3.70537$, but this is complicated enough that I'm not very confident about this value.
Edit: In the comments, Clement asks how we know that the remainder term in the expansion
$$
\Delta(\epsilon) \;=\; \kappa \epsilon^3 \,+\, \lambda \epsilon^4 \,+\,\mu\epsilon^5 \,+\, o(\epsilon^5)
$$
is indeed $o(\epsilon^5)$. Well, observe that the function $f(x,\epsilon)$ has a simple antiderivative with respect to $x$:
$$
\begin{align*}
F(x,\epsilon) \;&=\; \epsilon\sqrt{\frac{\pi}{2}}\left(\mathrm{erf}\left(\frac{x-\epsilon}{\sqrt2}\right)-\mathrm{erf}\left(\frac{x}{\sqrt2}\right)\right) \\
&\qquad+\, (1-\epsilon)\sqrt{\frac{\pi}{2}}\left(\mathrm{erf}\left(\frac{x}{\sqrt{2(1+\epsilon)}}\right)-\mathrm{erf}\left(\frac{x-\epsilon^2}{\sqrt{2(1+\epsilon)}}\right)\right)
\end{align*}
$$
where $\mathrm{erf}$ is the error function. Since $\mathrm{erf}$ is an entire function, it is clear that $F(x,\epsilon)$ is analytic on $\mathbb{R}\times(-1,1)$. It is easy to check that $F(x,\epsilon)\to 0$ as $x\to\pm\infty$, so
$$
\Delta(\epsilon) \;=\; 2 F(\psi_1(\epsilon),\epsilon)-2F(\psi_2(\epsilon),\epsilon)+2F(\psi_3(\epsilon),\epsilon)
$$
Since $\psi_1$, $\psi_2$, and $\psi_3$ are $C^\infty$, we can compute the power series for $\Delta(\epsilon)$ in the usual way, giving the results above. The remainder is $o(\epsilon^5)$ because of Taylor's Theorem.
Let's break Taylor's theorem down a bit. The best linear approximation to $f$ at any given point $x$ is given by the first-order Taylor series:
$$
f(x + v) \approx f(x) + \nabla f(x)^T v,
$$
where the error is $o(\|v\|)$.
You can visualize this for $f : \mathbb R^2 \to \mathbb R$ by realizing that the graph of the linear approximation is the plane tangent to the graph of $f$ at $x$. This is true in higher dimensions, too; just replace "plane" with "hyperplane".
Gradient descent is basically what happens when you attempt to take the best possible step towards minimizing $f$, using only this linear approximation. Since the approximation is linear, it has no local extrema, and especially is not convex. Hence we can't actually locally minimize --- the best that we can do is go along its steepest downward slope. But we can't just go down this slope forever --- we would never stop, because there is no minimum to the linear approximation. Another way to think about this is that the linear approximation contains only directional information. The "learning rate" $\epsilon$ basically says "how close to $x$ do I actually trust my linear approximation --- how far am I willing to step along this gradient?"
Now on to the second-order term: Taylor's theorem says that the best quadratic approximation of $f$ at $x$ is given by the second-order Taylor series:
$$
f(x) \approx f(x) + \nabla f(x)^Tv + \frac{1}{2} v^TH(x)v,
$$
where the error is $o(\|v\|^2)$. You can picture the graph as something similar to a "tangent paraboloid" at $x$. Hence you can think of the second-order term as a "correction" factor to the linear approximation, adding in local curvature information on top of the $\nabla f(x)^Tv$ term. If you need convincing that the Hessian provides curvature information, recall that $\det H(x)$ is the Gaussian curvature of the graph of $f$ at $x$ (this fact was probably called the second partial derivative test in your 3D calc class).
It should be clear from this dicussion that, using only the linear approximation (gradient descent), we can't go very far at all and still have a good approximation of $f$ (unless $f$ is very close to linear), and in particular there's no reason why a sufficiently large step along the gradient wouldn't increase $f$. Similarly, the author should note that the same thing can happen with the second-order approximation --- step too far, and the approximation diverges from $f$ sufficiently that we have no idea whether $f$ will increase or decrease!
Re: your question about $g^Tg$ as a magnitude, note that $\| g \|_2 = \sqrt{g^Tg}$ --- this is often in fact taken as the definition of the Euclidean norm.
Edit --- I think I misunderstood your question about the gradient magnitude; let me try again:
Let $g = \nabla f(x)$. When computing the gradient descent step, the optimal direction is parallel to $ - g$, i.e. the negative gradient. The actual step that we take (call it $v$), is parallel to the gradient, with some chosen step size:
$$
v = -\varepsilon g.
$$
Hence the improvement that comes from the linear approximation is
$$
(f(x) + g^T v) - f(x) = g^T v = - \varepsilon g^T g.
$$
so I think that "improvement afforded by the magnitude of the gradient" is not quite accurate --- the "magnitude of the gradient" isn't really a variable; it's your choice of learning rate $\varepsilon$ that controls the amount of improvement.
Best Answer
It's simpler than you think. Look at the right hand side as a function of $\epsilon:$ $$h(\epsilon) = \dfrac 12 \epsilon^2g^THg - \epsilon g^Tg + f(x^{(0)}).$$ To make the gradient descent work fastest, you want this small as possible. But it's a quadratic in terms of $\epsilon$ and when the leading term $g^THg$ is positive, this quadratic is minimized at where the derivative is zero: $$0 = h'(\epsilon) = \epsilon g^THg - g^Tg $$ and you will get what you desired.