The quick answer would be, because the Newton method is an higher order method, and thus builds better approximation of your function. But that is not all.
Newton method typically exactly minimizes the second order approximation of a function $f$. That is, iteratively sets
$$ x \gets x - \left[ \nabla^2 f(x) \right]^{-1} \nabla f(x). $$
Gradient has access only to first order approximation, and makes update
$$ x \gets x - h \nabla f(x), $$
for some step-size $h$.
Practical difference is that Newton method assumes you have much more information available, makes much better updates, and thus converges in less iterations.
If you don't have any further information about your function, and you are able to use Newton method, just use it.
But number of iterations needed is not all you want to know. The update of Newton method scales poorly with problem size. If $x \in \mathbb{R}^d$, then to compute $ \left[ \nabla^2 f(x) \right]^{-1} $ you need $\mathcal{O}(d^3)$ operations. On the other hand, cost of update for gradient descent is linear in $d$.
In many large-scale applications, very often arising in machine learning for example, $d$ is so large (can be a billion) that you are way beyond being able to make even a single Newton update.
Best Answer
You can take a look at this slide by John Langford assertion 2. It explains how to come up with a rank-1 matrix that takes a vector to another vector, which is very intuitive algebraically. This alone is not going to yield an updated Hessian matrix close to the previous Hessian, in any sense (Hessians have typically much higher rank than 1). So I guess the idea is to add the rank-1 piece, and then subtract another rank-1 piece from the previous Hessian:
$B_{k+1} = B_k + U - V$, where $U (x_{k+1} - x_k) = \nabla_{x_{k+1}} f - \nabla _{x_k} f$ (the secant condition), and $V (x_{k+1} - x_k) = B_k (x_{k+1} - x_k)$, where $U,V$ both have rank 1.
This seems the most intuitive way of remembering the formula, but there are also derivations based on weighted Frobenius norm $\|B_{k+1} - B_k\|_{F,w}$, see this. Another one based on maximum entropy argument, similar to Bayesian update of Kalman Filter, is here.
I kinda came up with the first derivation idea, which is the easiest mathematically and most useful from implementation point of view (since rank-1 update means we just need to do a dense vector dot product).
I assume you know how to transform from $B_{k+1}$ to $B_{k+1}^{-1}$ easily using Sherman-Morrison formula (see wikipedia), so that part is just linear algebra. Experts in convex optimization are very welcome to correct me.