The Gauss-Newton method is an approximation of the Newton method for specialized problems like
$$
\underset{\mathbf{x}}{\operatorname{argmin}}\;\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})
$$
In other words, it finds a solution $\mathbf{x}$ that minimizes the squared norm of a nonlinear function $||\mathbf{r}(\mathbf{x})||_2^2$.
If you look at the update step for gradient descent and Gauss-Newton applied to the equivalent problem $\frac{1}{2}\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})$, the relationship becomes clear:
Gradient descent
$$
\begin{align}
\mathbf{x}_{n+1} &= \mathbf{x}_n - \mu \Delta(\frac{1}{2}\mathbf{r}(\mathbf{x_n})^T\mathbf{r}(\mathbf{x_n})) \\
&= \mathbf{x}_n - \mu\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n)
\end{align}
$$
Gauss-Newton
$$
\begin{align}
\mathbf{x}_{n+1} = \mathbf{x}_n - (\mathbf{J}_r^T\mathbf{J}_r)^{-1}\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n)
\end{align}
$$
The structure of the problem enables the approximation of the Hessian used in Newton's method as $\mathbf{H} \approx \mathbf{J}_r^T\mathbf{J}_r$. As you said, the method jumps to the minimum of the second order Taylor-approximation around $\mathbf{x}_n$ in every step.
The qualitative behavior in the neighborhood of a solution is that the approximated second-order (curvature) information allows for convergence along a more direct, less "zigzaggy" path. It also converges faster than gradient descent. Imagine how the region that is approximated as a quadratic function (the one that you "jump across" in an iteration) becomes smaller and smaller. In turn, that approximation becomes more and more accurate for a sufficiently smooth function.
However, if the initial guess is far away from a solution, the (approximated) Hessian can become ill-conditioned. The resulting correction-vector is not guaranteed to point in the general direction of descent anymore (if the angle between it and the steepest descent is larger than 90°, the method actually diverges).
It appears that there are methods for accelerated projected/proximal gradient descent, though no one seems to have worked out how to combine the state-of-the-art best methods for accelerated gradient descent (e.g., Adam, RMSprop, etc.) with projected/proximal gradient descent yet -- so you can choose either projected/proximal gradient descent with a sub-par method of acceleration, or normal gradient descent with a method of acceleration that works better in practice.
In other detail... there are multiple methods of acceleration: e.g., simple momentum, Nesterov acceleration, Adagrad, Adadelta, RMSprop, Adam. In practice, experience (with ordinary gradient descent) suggests that some of these perform better than others. For instance, for some machine learning tasks, right now Adam appears to be the most effective of those.
When you move from ordinary gradient descent to projected/proximal gradient descent, there has been some work on combining some of those methods of acceleration to proximal gradient descent... but the literature lags a bit. For instance, it appears that no one has worked out how to apply Adam-style momentum to projected or proximal gradient descent. Perhaps in time the literature will catch up.
Best Answer
I just posted an answer on stats.stackexchange that is meant to help with gaining intuition about the difference between Classical Momentum (CM) and Nesterov's Accelerated Gradient (NAG), which also contains a visualization, though it isn't real or from a tutorial, but made up by me.
Following is a copy of my answer. (This meta.stackexchange answer seems to approve of this copy-paste pattern.)
tl;dr
Just skip to the image at the end.
NAG_ball's reasoning is another important part, but I am not sure it would be easy to understand without all of the rest.
CM and NAG are both methods for choosing the next vector $\theta$ in parameter space, in order to find a minimum of a function $f(\theta)$.
In other news, lately these two wild sentient balls appeared:
It turns out (according to the observed behavior of the balls, and according to the paper On the importance of initialization and momentum in deep learning, that describes both CM and NAG in section 2) that each ball behaves exactly like one of these methods, and so we would call them "CM_ball" and "NAG_ball":
(NAG_ball is smiling, because he recently watched the end of Lecture 6c - The momentum method, by Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, and thus believes more than ever that his behavior leads to finding a minimum faster.)
Here is how the balls behave:
Let $\theta_t$ be a ball's $t$-th location in parameter space, and let $v_t$ be the ball's $t$-th jump. Then jumping between points in parameter space can be described by $\theta_t=\theta_{t-1}+v_t$.
A small fraction of the momentum of $v_{t-1}$ is lost due to friction with the air.
Let $\mu$ be the fraction of the momentum that is left (the balls are quite aerodynamic, so usually $0.9 \le \mu <1$). Then the Momentum Jump is equal to $\mu v_{t-1}$.
(In both CM and NAG, $\mu$ is a hyperparameter called "momentum coefficient".)
Similarly, the Slope Jump is in the direction of the steepest slope downward (the direction opposite to the gradient), and the larger the gradient, the further the jump.
The Slope Jump also depends on $\epsilon$, the level of eagerness of the ball (naturally, $\epsilon>0$): The more eager the ball, the further the Slope Jump would take it.
(In both CM and NAG, $\epsilon$ is a hyperparameter called "learning rate".)
Let $g$ be the gradient in the starting location of the Slope Jump. Then the Slope Jump is equal to $-\epsilon g$.
Thus, CM_ball's Double Jump is: $$v_{t}=\mu v_{t-1}-\epsilon\nabla f\left(\theta_{t-1}\right)$$
In contrast, NAG_ball thought about it for some time, and then decided to always start with the Momentum Jump.
Therefore, NAG_ball's Double Jump is: $$v_{t}=\mu v_{t-1}-\epsilon\nabla f\left(\theta_{t-1}+\mu v_{t-1}\right)$$
NAG_ball's reasoning
So I should consider the situation as if I have already made my Momentum Jump, and I am about to make my Slope Jump.
Finally, yesterday I was fortunate enough to observe each of the balls jumping around in a 1-dimensional parameter space.
I think that looking at their changing positions in the parameter space wouldn't help much with gaining intuition, as this parameter space is a line.
So instead, for each ball I sketched a 2-dimensional graph in which the horizontal axis is $\theta$.
Then I drew $f(\theta)$ using a black brush, and also drew each ball in his $7$ first positions, along with numbers to show the chronological order of the positions.
Lastly, I drew green arrows to show the distance in parameter space (i.e. the horizontal distance in the graph) of each Momentum Jump and Slope Jump.
Appendix 1 - A demonstration of NAG_ball's reasoning
In this mesmerizing gif by Alec Radford, you can see NAG performing arguably better than CM ("Momentum" in the gif).
(The minimum is where the star is, and the curves are contour lines. For an explanation about contour lines and why they are perpendicular to the gradient, see videos 1 and 2 by the legendary 3Blue1Brown.)
An analysis of a specific moment demonstrates NAG_ball's reasoning:
Appendix 2 - things/terms I made up (for intuition's sake)
Appendix 3 - terms I didn't make up