The quick answer would be, because the Newton method is an higher order method, and thus builds better approximation of your function. But that is not all.
Newton method typically exactly minimizes the second order approximation of a function $f$. That is, iteratively sets
$$ x \gets x - \left[ \nabla^2 f(x) \right]^{-1} \nabla f(x). $$
Gradient has access only to first order approximation, and makes update
$$ x \gets x - h \nabla f(x), $$
for some step-size $h$.
Practical difference is that Newton method assumes you have much more information available, makes much better updates, and thus converges in less iterations.
If you don't have any further information about your function, and you are able to use Newton method, just use it.
But number of iterations needed is not all you want to know. The update of Newton method scales poorly with problem size. If $x \in \mathbb{R}^d$, then to compute $ \left[ \nabla^2 f(x) \right]^{-1} $ you need $\mathcal{O}(d^3)$ operations. On the other hand, cost of update for gradient descent is linear in $d$.
In many large-scale applications, very often arising in machine learning for example, $d$ is so large (can be a billion) that you are way beyond being able to make even a single Newton update.
L1 optimization is a huge field with both direct methods (simplex, interior point) and iterative methods. I have used iteratively reweighted least squares (IRLS) with conjugate gradient and approximate inner iterations. In practice, the fact that the absolute value is not differentiable at the origin is not really a problem. E.g., minimizing over x, $\sum |x - x_i|$ just says x is the median of the $x_i$ since the derivatives are all +/- 1 (assuming x is not one of the $x_i$). Here is a reference. Robust methods in inverse theory JA Scales, A Gersztenkorn Inverse problems 4 (4), 1071. There is a pdf on Google Scholar. The algorithm described can be used for any Lp norm optimization for p not less than 1. They are all nonlinear except for $p=2$ but we found that using the approximate inner iterations made the L1 problem almost as fast as the least squares problem. Added: an interesting historical note is that L1 optimization was invented by Boscovich before Gauss invented least squares.
Best Answer
The traditional approach would be to reformulate the problem to make it smooth. In this case, it's easy: $$ \min_x \ \|Ax - b\|_{\infty} $$ is equivalent to $$ \min_{x,t} \ t \quad \text{subject to} \ -te \leq Ax - b \leq te, $$ where $e = (1, 1, \dots, 1)$. This is a linear program that you can solve with a standard LP solver. You can play the same trick with the $\ell_1$ norm, which is often used in curve fitting to mitigate the influence of outliers: $$ \min_x \ \|Ax - b\|_1 $$ is equivalent to $$ \min_{x,s,t} \ \sum_i s_i+t_i \quad \text{subject to} \ Ax-b = s - t, \ (s, t) \geq 0. $$ Again, this is a linear program.
If you want to avoid constraints, you could apply a subgradient method to the nonsmooth objective $\|Ax-b\|_{\infty}$. It's almost identical to applying the gradient method except that instead of a gradient, you compute a subgradient. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/subgradients.pdf.