They're not special. They're just convenient. It's relatively easy to tell what happens to a matrix when you apply an elementary row operation to it, and this isn't quite as true for more complicated types of operations.
In the language of group theory, elementary matrices form a set of generators for the group of invertible square matrices. You could choose a different set of generators if you wanted to, but again, the elementary matrices are convenient.
In optimisation you normally work in a particular space, known as your search space. An example of a simple search space is $\mathbb{R}$ which are the real numbers. Another example is $\mathbb{R}_{>0}$. That is, positive real numbers.
The main aim of conjugate gradient methods is to solve the following equation: $Ax = b$ where $x$ is a vector of unknown quantities, $A$ is symmetric positive definite, and $b$ is known also. It can be difficult to find the inverse for large, sparse systems, so that $A^{-1}$ is hard to calculate. Instead we can use optimisation to find the optimal $A$ to solve this system of equations.
So we have an optimisation problem (find the optimal $A$) and we know optimisation problems need a search space. But the search space is a bit weird now, since we search over an entire matrix! Not something easy like the reals. More accurately, we would like to minimise: $r := b - A x \,$ the residual error.
Now the Kyrlov subspace is defined as follows:
\begin{equation}
{\mathcal {K}}_{r}(A,b)=\operatorname {span} \,\{b,Ab,A^{2}b,\ldots ,A^{r-1}b\}.
\end{equation}
In the conjugate gradient optimisation scheme there are lines in the algorithm that look similar to things like: $A^{r-1}b$.
Therefore we just say the solution of the conjugate gradient involves spanning the Kyrlov subspace. It is more of a technical note, since remember optimisation needs (i) an objective, (ii) a space on which the solution must exist.
It is more than a technical note, if you want to do research by which then it is important to know the properties of the Kyrlov subspace! It could help in modelling constraints, understanding convergence rates, understanding convergence behaviour etc...
It is similar to understanding Galerkin methods for Finite Element Analysis.... Sure theoretically you need to know Galerkin theory for FEA, but if you are a practising engineer, this knowledge is just a complication. BUT, if you are a researcher, it would be important to go through the derivation, and understand the inner workings of weak formulations, the functions they can permit etc...
This answer goes more into the mathematics intuition of what I just discussed:
https://math.stackexchange.com/a/2714985/580635
Best Answer
A good way to motivate the use of Krylov subspace methods is the following idea, which is closely related to a derivation of the conjugate gradient method. $\newcommand{\dotp}[2]{\langle #1,#2 \rangle}$
Assume, we have an initial guess $x_0$ for our solution $x$ of the linear system $$ A x = b \,, $$ where $A$ is a symmetric, positive definit matrix. Let us denote the error by $e_0$; $$ e_0 := x - x_0 \,. $$ Furthermore, let the residual $$ r_0 := b - A x_0 \,. $$
We now want to make a correction step of the form $$ x_1 = x_0 + p \,, $$ to improve our approximation, where $p$ is some vector that we still need to determine. The error after the correction $e_1$ is $$ e_1 = x - x_1 = x - (x_0 + p) = e_0 - p \,. $$ We take a look at the $A$-norm of the error; we have that \begin{align*} \| e_0 - p \|_A^2 &= \dotp{A e_0 - A p}{e_0 - p} \\ &= \dotp{A e_0}{e_0} - \dotp{A e_0}{p} - \dotp{A p}{e_0} + \dotp{A p}{p} \\ &= \dotp{A e_0}{e_0} - 2 \dotp{Ae_0}{p} + \dotp{Ap}{p} \\ &= c - 2\dotp{r_0}{p} + \dotp{Ap}{p} \,, \end{align*} where we have used that $A e_0 = r_0$ and where $c$ is some constant independent of $p$. Let us define the function $\phi$ by setting $$ \phi(p) = \dotp{Ap}{p} - 2 \dotp{r_0}{p} + c \,. $$ The function $\phi$ is just the error of $e_1$ in dependence of the correction $p$ measured in the $A$-norm. Hence, we want to choose $p$ such that $\phi(p)$ is small. For this purpose, we compute the gradient of $\phi$.
A small computation which uses the symmetry of $A$ gives that $$ \nabla \phi(p) = 2(Ap - r_0) \,. $$ Recall that the negative of the gradient always points in the direction in which the function decreases the most. At position $x_0$, where $p=0$, we have that $$ \nabla \phi(0) = -2r_0 \,. $$ Thus, the function $\phi$ decreases the most in the direction of the residual $r_0$. This direction, is also called the direction of steepest decent.
For this reason, assume that we choose $p = \alpha_0 r_0$ for some scalar $\alpha$. In the next step, we choose, again, a vector $p$ and add its value to the current iterate. Using the same argument we choose $\alpha_1 r_1$ for some scalar $\alpha_2$. We continue in this manner. Thus, we have that $$ x_{k+1} = x_0 + \alpha_0 r_0 + \alpha_1 r_1 + \cdots + \alpha_k r_k \,, $$ in other words that $$ x_{k+1} \in x_0 + \mbox{span}\{ r_0, r_1, \dots, r_k \} \,. $$
Now, using the relation that $$ r_{k+1} = b - A x_{k+1} = b - A (x_k + \alpha_k r_k) = r_k - \alpha_k A r_k \,, $$ we see that $$ \mbox{span}\{ r_0, r_1, \dots, r_k \} = \mbox{span}\{ r_0, A r_0, \dots, A^{k-1} r_0 \} \,. $$ Hence, a Krylov subspace is the space spanned by the vectors of successive directions of steepest decent.
If you want to read more about a derivation of the conjugate gradient method which uses the above mentioned minimization principle, see An Introduction to the Conjugate Gradient Method Without the Agonizing Pain by J. R. Shewchuk.