I wouldn't agree that direct techniques always use a "formula". For instance, in solving linear systems of equations "directly", one typically doesn't use a closed-form solution, but instead something like LU or Cholesky decomposition.
Iterative solvers, on the other hand, would be things like Gauss-Seidel iteration, SOR (successive over-relaxation), Krylov subspace methods like conjugate gradients, or multilevel methods.
So what's the big difference? In the case of linear systems, I'd say the following are two of the main differences:
With a direct method, you have to do a certain amount of work, and then you obtain your solution. For instance, if you're doing LU factorization, you can't stop halfway through and say "alright, that's good enough". (Although there is something called incomplete LU, but it's actually used in iterative methods!) To get a solution, you need to run the algorithm to the end. Typically, the solution you get will then be close to the exact one.
With iterative methods, you always update your old guess and get (hopefully) a bit closer to the true solution. This means that you can decide how much work you want to invest depending on how accurate you need your solution. For instance, CG is known to converge to the exact solution in $n$ steps for a matrix of size $n$, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than $n$ steps. In fact, we can analyze how fast CG converges. The end result is that CG is used as an iterative method for large linear systems today.
The second big difference is that, for a direct method, you generally need to have the entire matrix stored in memory. Not so in iterative methods: here, you typically only need a way to compute the application of the matrix to a vector, i.e., the matrix-vector product. So if you have a very large matrix, but you can relatively quickly compute its application to a vector (for instance, because the matrix is very sparse or has some other kind of structure), you can use this to your advantage.
In general, there is no relation between the absolute value of the error $\epsilon_n$ and the number $\delta_n$!
For example consider a function $f : \mathbb{R} \rightarrow \mathbb{R}$ which far away from the desired root is positive and decays as $x \rightarrow \exp(-\lambda x)$, where $\lambda >0$. If the user has picked an initial guess far from the root and inside the bad region, then Newton's method will essentially reduce to $x_{n+1} = x_n + \frac{1}{\lambda}$. If $\lambda$ is sufficiently large, then your routine will signal convergence, despite the fact that you are moving away from the root! A good routine should also monitor the residual, i.e the absolute value of $f(x_n)$, but this will not save you in this case, as $f(x_n) \rightarrow 0$.
If you want a truly robust code, then you should strive to maintain a bracket, i.e. an interval $[a,b]$ for which you are certain that $f(a)$ and $f(b)$ have different signs. You should look into hybrid methods such as merger between the secant method and the bisection method.
If you can decisively say that you are close to a root, then you can approximate the error as $x - x_{n} \approx \frac{f'(x_n)}{f(x_n)}$. This estimate is however worthless unless you are close to a root, so look for a bracket.
Best Answer
You could also call that geometric convergence. The "linear" refers to the fact that the reference error on the right side only appears in the first power. With quadratic convergence you would get $$ d(x^{i+1},x^∗)≤λd(x^i,x^∗)^2\implies d(x^i,x^∗)\leλ^{2^i-1}d(x^0,x^∗)^{2^i} $$ which converges to zero if $λd(x^0,x^∗)<1$. This extends to similar formulas for higher order convergence.
Be careful of where you start your sequence, the linear convergence formula should be $$ d(x^i,x^∗)\leλ^{i-1}d(x^1,x^∗) $$ as you can check when comparing the initial condition at $i=1$.