The point is that for a function $f : \mathbb{R} \to \mathbb{R}$, $f'(a)$ defines a linear transformation, just life $Df({\bf a})$ does for a function $f : \mathbb{R}^n \to \mathbb{R}^m$.
In single variable calculus, we are taught that the derivative of $f(x)$ at a point $x = a$ is a real number $f'(a)$ which represents the slope of the tangent line to the graph of $f(x)$ at the point $x = a$. The equation of this tangent line is $y = f'(a)(x-a) + f(a)$; this is the best linear approximation of $f(x)$ near $x = a$, not the derivative itself.
If we do the change of variables $x^* = x - a$, $y^* = y - f(a)$, the tangent line becomes $y^* = f'(a)x^*$; this is a linear function, which is just a linear transformation $\mathbb{R} \to \mathbb{R}$, and the standard matrix of this linear transformation is the $1\times 1$ matrix $[f'(a)]$.
In higher dimensions, we start with $f : \mathbb{R}^n \to \mathbb{R}^m$ and at a point ${\bf a} \in \mathbb{R}^n$ we have the derivative $Df({\bf a})$ which is an $m\times n$ matrix $Df({\bf a}) = \left[\frac{\partial f_i}{\partial x_j}({\bf a})\right]$ which is sometimes called the Jacobian of $f$ at ${\bf a}$. Then the best linear approximation of $f({\bf x})$ near ${\bf x} = {\bf a}$ is ${\bf y} = Df({\bf a})({\bf x}-{\bf a}) + f({\bf a})$.
If we do the change of variables ${\bf x}^* = {\bf x} - {\bf a}$, ${\bf y}^* = {\bf y} - f({\bf a})$, the tangent line becomes ${\bf y}^* = Df({\bf a}){\bf x}^*$; this is a linear transformation $\mathbb{R}^n \to \mathbb{R}^m$, and the standard matrix of this linear transformation is the $m\times n$ matrix $Df({\bf a})$.
So the derivative in single variable calculus is just a special case of the derivative in multivariable calculus; just set $m = n = 1$.
As for your question, 'why can't we use a number for the best linear approximation for a function $\mathbb{R}^n \to \mathbb{R}^m$?', note that the approximating function must be $\mathbb{R}^n \to \mathbb{R}^m$, and because it is linear, it must be of the form ${\bf y} = A{\bf x} + {\bf b}$ where $A$ an $m \times n$ matrix and ${\bf b} \in \mathbb{R}^m$. By enforcing the condition that the linear approximation must agree with the function at ${\bf x} = {\bf a}$, we find that the linear approximation must be of the form ${\bf y} = A({\bf x} - {\bf a}) + f({\bf a})$. So the only thing left to determine is the $m\times n$ matrix $A$, not a single number as in single variable calculus.
I think this comes down mainly to a conceptual issue: Imagine that you have a way of approximating the behavior of something, call it a function. Now, imagine that your approximation gets more and more accurate the closer you move to this function. If you were infinitely close to the function, your approximation becomes infinitely more accurate (and thus ceases to be an approximation—it becomes exact).
We know the slope of a line between two points, $(x_1, y_1)$ and $(x_2, y_2)$, is $\displaystyle \frac{y_2 - y_2}{x_2 - x_1}$—which can also be looked at as the average rate-of-change of the function between those two points (or, the approximation of the function's rate of change between those two points). Now, the closer these two points are to one another, the more accurate your approximation will be.
Let's take two points on an arbitrary function. We'll call these points $(x, f(x))$ and $(x + \delta{x}, f(x + \delta{x}))$, where $\delta{x}$ is defined as being an infinitely small quantity. (That is, if the function is differentiable, these points are infinitely close to one another because they deviate by an infinitely small quantity).
The derivative is defined as the slope of the line "between" these two infinitely close points. That is...
$$\frac{df}{dx} = \frac{f(x + \delta{x}) - f(x)}{x +\delta{x} - x} = \frac{f(x + \delta{x}) - f(x)}{\delta{x}}$$
... or, the more usual...
$$\frac{df}{dx} = \lim_{\Delta{x} \rightarrow 0}\frac{f(x + \Delta{x}) - f(x)}{\Delta{x}}$$ (That is, as $\Delta{x}$ moves infinitely close to zero, becoming infinitely small).
Since these points are infinitely close, your approximation of the rate-of-change of the function becomes infinitely more accurate—or, in other words, it becomes exact.
Best Answer
There’s a subtle misunderstanding in your question that might be the source of some confusion. The Jacobian matrix (i.e., the differential) isn’t the best linear approximation to a function near a point $\mathbf x_0$. It’s the best linear approximation to the change in the function’s value near the point. That is, $$\Delta f_{\mathbf x_0}=f(\mathbf x_0+\Delta\mathbf x)-f(\mathbf x_0)=\operatorname{d}f_{\mathbf x_0}[\Delta\mathbf x]+o(\|\Delta \mathbf x\|).$$ Observe that $\operatorname{d}f_{\mathbf x_0}$ operates on a displacement $\Delta\mathbf x$ (technically, on a vector in the tangent space at $\mathbf x_0$), not on an element of $f$’s domain. If $f:\mathbb R^n\to\mathbb R^m$, then $\operatorname{d}f:\mathbb R^n\to\mathbb R^m$, and in matrix terms this linear transformation is represented by multiplication by the $m\times n$ matrix $\mathbf{Jac}_f(\mathbf{x_0})$, i.e., $$\operatorname{d}f_{\mathbf x_0}:\mathbf h\mapsto\mathbf{Jac}_f(\mathbf{x_0})\,\mathbf h,$$ where $\mathbf h$ is the $1\times n$ column vector corresponding to $\Delta\mathbf x$. One way to think of $\operatorname{d}f$ is as a rule that assigns an $m\times n$ matrix, $\mathbf{Jac}_f(\mathbf{x})$, to each point $\mathbf x$ of the domain of $f$. The matrix can vary from point to point, of course.
As a concrete example, take $f:(x,y)\mapsto x^2-y^2$ as in your comments to another answer. The Jacobian at the point $(x_0,y_0)$ is the $1\times 2$ matrix $\begin{bmatrix}2x_0&-2y_0\end{bmatrix}$, so the best linear approximation to $f$ at $(x_0,y_0)$ is $$\begin{align}f(x,y)&\approx f(x_0,y_0)+\begin{bmatrix}2x_0&-2y_0\end{bmatrix}\begin{bmatrix}x-x_0\\y-y_0\end{bmatrix} \\ &= x_0^2-y_0^2+2x_0(x-x_0)-2y_0(y-y_0).\end{align}$$ To illustrate this with some concrete values, $$f(3.2,-5.1)\approx3^2-(-5)^2+\begin{bmatrix}6&10\end{bmatrix}\begin{bmatrix}0.2&-0.1\end{bmatrix}^T=9-25+0.2=-15.8.$$ The actual value is $-15.77$.
A somewhat more interesting example is $f:\begin{bmatrix}x\\y\\z\end{bmatrix}\mapsto\begin{bmatrix}x^2-y^2+3z\\2xy\end{bmatrix}$. The Jacobian is $\begin{bmatrix}2x&-2y&3\\2y&2x&0\end{bmatrix}$, and so $$f(x,y,z)\approx\begin{bmatrix}x_0^2-y_0^2+3z_0\\2x_0y_0\end{bmatrix}+\begin{bmatrix}2x_0&-2y_0&3\\2y_0&2x_0&0\end{bmatrix}\begin{bmatrix}x-x_0\\y-y_0\\z-z_0\end{bmatrix}$$ and $$f(3.2,-5.1,1.7)\approx\begin{bmatrix}3^2-(-5)^2+3\cdot1\\2\cdot3\cdot(-5)\end{bmatrix}+\begin{bmatrix}6&10&3\\-10&6&0\end{bmatrix}\begin{bmatrix}0.2\\-0.1\\0.7\end{bmatrix}=\begin{bmatrix}-13\\-30\end{bmatrix}+\begin{bmatrix}2.3\\-2.6\end{bmatrix}=\begin{bmatrix}-10.7\\-32.6\end{bmatrix}.$$ The actual value is $\begin{bmatrix}-10.67&-32.64\end{bmatrix}^T$.