The Gauss-Newton method is an approximation of the Newton method for specialized problems like
$$
\underset{\mathbf{x}}{\operatorname{argmin}}\;\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})
$$
In other words, it finds a solution $\mathbf{x}$ that minimizes the squared norm of a nonlinear function $||\mathbf{r}(\mathbf{x})||_2^2$.
If you look at the update step for gradient descent and Gauss-Newton applied to the equivalent problem $\frac{1}{2}\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})$, the relationship becomes clear:
Gradient descent
$$
\begin{align}
\mathbf{x}_{n+1} &= \mathbf{x}_n - \mu \Delta(\frac{1}{2}\mathbf{r}(\mathbf{x_n})^T\mathbf{r}(\mathbf{x_n})) \\
&= \mathbf{x}_n - \mu\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n)
\end{align}
$$
Gauss-Newton
$$
\begin{align}
\mathbf{x}_{n+1} = \mathbf{x}_n - (\mathbf{J}_r^T\mathbf{J}_r)^{-1}\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n)
\end{align}
$$
The structure of the problem enables the approximation of the Hessian used in Newton's method as $\mathbf{H} \approx \mathbf{J}_r^T\mathbf{J}_r$. As you said, the method jumps to the minimum of the second order Taylor-approximation around $\mathbf{x}_n$ in every step.
The qualitative behavior in the neighborhood of a solution is that the approximated second-order (curvature) information allows for convergence along a more direct, less "zigzaggy" path. It also converges faster than gradient descent. Imagine how the region that is approximated as a quadratic function (the one that you "jump across" in an iteration) becomes smaller and smaller. In turn, that approximation becomes more and more accurate for a sufficiently smooth function.
However, if the initial guess is far away from a solution, the (approximated) Hessian can become ill-conditioned. The resulting correction-vector is not guaranteed to point in the general direction of descent anymore (if the angle between it and the steepest descent is larger than 90°, the method actually diverges).
If we increase $a$, we do not know if this will ever result in a valid choice. There is no guaranty for finding another valley farther away from where we are.
But we can be sure that we will find a point below the line $l$ if only we make $a$ small enough. This is because $c\in(0,1),$ which means that $\phi$ is steeper than $l$ at $a=0.$
EDIT: Why can we be sure that we will find a suitable $a$ with $\phi(a)<l(a)$?
Let $g(a)=l(a)-\phi(a).$ Then $g(0)=f(x_k)-f(x_k)=0$ and $g'(0)=(c-1)(\nabla f_k^T p_k)>0,$ because $c-1<0$ and $\nabla f_k^T p_k<0.$
We have
$$
g'(0)=\lim_{a\to 0}\frac{g(a)-g(0)}{a-0} = \lim_{a\to 0}\frac{g(a)}{a}
$$
From the definition of $\lim$: For each $\varepsilon>0$ there is a $\delta>0$ such that
$$
|a-0|<\delta \;\Rightarrow \; \left|\frac{g(a)}{a}-g'(0)\right|<\varepsilon
$$
Now set $\varepsilon=\frac12 g'(0)$ and set $\delta$ accordingly, such that the aforementioned implication holds. Then
$$
0<a<\delta \;\Rightarrow\; g'(0)-\frac 12 g'(0) < \frac{g(a)}{a} < g'(0)+\frac 12 g'(0) \\
\;\Rightarrow \; \frac 12 g'(0) < \frac{g(a)}{a}
\;\Rightarrow \; g(a) > \frac a2 g'(0) >0
$$
Best Answer
Using backtracking, assuming the direction selected is a descent direction, guarantees improvement in the minimization in each iteration.
In many cases it is sufficient, given enough steps, to get to local stationary point.
When not using Backtracking but, for example constant step size, one might diverge. Namely the step size should be set according to the surface being minimized and backtracking is one of the simplest way to do so.