A good way to motivate the use of Krylov subspace methods is the following
idea, which is closely related to a derivation of the conjugate gradient
method.
$\newcommand{\dotp}[2]{\langle #1,#2 \rangle}$
Assume, we have an initial guess $x_0$ for our solution $x$ of the linear
system
$$ A x = b \,, $$
where $A$ is a symmetric, positive definit matrix.
Let us denote the error by $e_0$;
$$ e_0 := x - x_0 \,. $$
Furthermore, let the residual
$$ r_0 := b - A x_0 \,. $$
We now want to make a correction step of the form
$$
x_1 = x_0 + p \,,
$$
to improve our approximation, where $p$ is some vector that we still need to
determine. The error after the correction $e_1$ is
$$
e_1 = x - x_1 = x - (x_0 + p) = e_0 - p
\,.
$$
We take a look at the $A$-norm of the error; we have that
\begin{align*}
\| e_0 - p \|_A^2
&= \dotp{A e_0 - A p}{e_0 - p} \\
&= \dotp{A e_0}{e_0}
- \dotp{A e_0}{p}
- \dotp{A p}{e_0}
+ \dotp{A p}{p} \\
&= \dotp{A e_0}{e_0}
- 2 \dotp{Ae_0}{p}
+ \dotp{Ap}{p} \\
&= c - 2\dotp{r_0}{p} + \dotp{Ap}{p} \,,
\end{align*}
where we have used that $A e_0 = r_0$ and where
$c$ is some constant independent of $p$.
Let us define the function $\phi$ by setting
$$
\phi(p) = \dotp{Ap}{p} - 2 \dotp{r_0}{p} + c
\,.
$$
The function $\phi$ is just the error of $e_1$ in dependence of the correction
$p$ measured in the $A$-norm. Hence, we want to choose $p$ such that $\phi(p)$
is small. For this purpose, we compute the gradient of $\phi$.
A small computation which uses the symmetry of $A$ gives that
$$
\nabla \phi(p) = 2(Ap - r_0)
\,.
$$
Recall that the negative of the gradient always points in the direction in
which the function decreases the most. At position $x_0$, where $p=0$, we have
that
$$ \nabla \phi(0) = -2r_0 \,. $$
Thus, the function $\phi$ decreases the most in the direction of the residual
$r_0$. This direction, is also called the direction of steepest decent.
For this reason, assume that we choose $p = \alpha_0 r_0$ for some scalar
$\alpha$.
In the next step, we choose, again, a vector $p$ and add its value to the
current iterate. Using the same argument we choose $\alpha_1 r_1$ for some
scalar $\alpha_2$. We continue in this manner.
Thus, we have that
$$
x_{k+1} = x_0 + \alpha_0 r_0 + \alpha_1 r_1 + \cdots + \alpha_k r_k
\,,
$$
in other words that
$$
x_{k+1} \in x_0 + \mbox{span}\{ r_0, r_1, \dots, r_k \}
\,.
$$
Now, using the relation that
$$
r_{k+1}
= b - A x_{k+1}
= b - A (x_k + \alpha_k r_k)
= r_k - \alpha_k A r_k \,, $$
we see that
$$
\mbox{span}\{ r_0, r_1, \dots, r_k \}
= \mbox{span}\{ r_0, A r_0, \dots, A^{k-1} r_0 \}
\,.
$$
Hence, a Krylov subspace is the space spanned by the vectors of successive
directions of steepest decent.
If you want to read more about a derivation of the conjugate gradient method
which uses the above mentioned minimization principle, see
An Introduction to
the Conjugate Gradient Method
Without the Agonizing Pain
by J. R. Shewchuk.
Best Answer
The common notion of distance is to consider an orthogonal projection $P$ onto the first linear subspace $V$, and an orthogonal projection $Q$ onto the other subspace $W$.
At this point we can define $$d(V,W) = \| P - Q \|$$ as the distance between these subspaces, where the norm used is the operator norm. For properties and applications see Section 2.5.3 of Golub and Van Loan.
This distance metric is used throughout GVL’s exposition on unsymmetrical eigenvalue problems (which involve Krylov methods) — see Chapter 7.