Let $X,Y$ and $Z$ be vector spaces, and $\mathcal{L}(A,B)$ the space of all linear maps from $A$ to $B$.
As noted above, if $F \in \mathcal{L}(X,\mathcal{L}(Y,Z))$, then we can form another map $F_{curry}:X \times Y \to Z$ defined by $F_{curry}(x,y) = F(x)(y)$. Notice that $F_{curry}$ is a bilinear mapping: fixing $x$, $F_{curry}(x,\cdot)$ is linear in the second slot, and $F_{curry} (\cdot,y)$ is linear in the first slot. Conversely, given a bilinear mapping from $G: X \times Y \to Z$, I can produce an element $G_{uncurry}\mathcal{L}(X,\mathcal{L}(Y,Z))$ in the way you would expect: $G_{uncurry}(x)(y) = G(x,y)$. Keyword here is "Curry-howard isomorphism".
So $\mathcal{L}(X,\mathcal{L}(Y,Z))$ can be canonically identified with the space of bilinear mappings from $X \times Y \to Z$. These in tern could be identified with linear mappings from the space $X \otimes Y \to Z$, the so called "tensor product" of $X$ and $Y$, but I will not go into that.
You might be curious how you could work with such an object. What data do you need to write down? For a linear map, you only have to specify action on a basis, but a bilinear map is not a linear map. It turns out (you should check) that specifying the action on all pairs of basis vectors is enough.
Lets get back down to earth and examine a very special case. Let $f:\mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x,y) = x^2y$.
$D(f)\big|_{(x,y)}$ is the linear map given by the matrix $\left[ \begin{matrix} 2xy&x^2\end{matrix} \right]$. That is to say, $D(f)\big|_{(x,y)}(\Delta x,\Delta y) = 2xy\Delta x + x^2\Delta y \approx f(x+\Delta x,y+\Delta y) - f(x,y)$. Notice that the transpose of this matrix is the "gradient" of $f$. Only functions from $\mathbb{R^n} \to \mathbb{R}$ have gradients.
The second derivative should now tell you how much the derivative changes from point to point. If we increment $(x,y)$ by a little bit to $(x+\Delta x,y)$ then we should expect the derivative to increase by about $\left[ \begin{matrix} 2y\Delta x&2x \Delta x\end{matrix} \right]$. Similarly, when we increase $y$ by $\Delta y$, the derivative should change by about $\left[ \begin{matrix} 2x \Delta y&0\Delta y\end{matrix} \right]$.
By linearity, if we change from $(x,y)$ to $(x+\Delta x,y+\Delta y)$, we expect the derivative to change by $$\left[ \begin{matrix} \Delta x&\Delta y\end{matrix} \right] \left[ \begin{matrix} 2y&2x\\2x&0\end{matrix} \right]$$
This gives a matrix which is the approximate change in the derivative. You can then apply this to another vector if you so wish.
Summing it up, if you wanted to see approximately how much the derivative changes from $(x,y)$ to $(x+\Delta x_2,y+\Delta y_2)$ when both are evaluated in the same direction $(\Delta x_1,\Delta y_1)$, you would perform the computation:
$$\left[ \begin{matrix} \Delta x_2&\Delta y_2\end{matrix} \right] \left[ \begin{matrix} 2x&2x\\2x&0\end{matrix} \right] \left[ \begin{matrix} \Delta x_1\\\Delta y_1\end{matrix} \right]$$
The matrix of second partials derived above is called the Hessian, but it a bit misleading to write it as a matrix, since it is really acting as a bilinear form in the manner shown above, i.e. $H(v_1,v_2) = v_1^T H v_2$. You may remember seeing the Hessian arise in multivariable calculus when classifying critical points as maxima, minima, or saddles. In general, the sign eigenvalues of the hessian matrix tell the whole story (although, if there are some zero eigenvalues you might have to climb up the derivative ladder to trilinear forms, etc).
Notice that I only got a Hessian "matrix" because the codomain of $f$ was one dimensional. If it has been, say, $3$ dimensional I would have needed $3$ such matrices, and they would naturally align themselves into a $2\times2\times3$ dimensional box, which would represent a higher order tensor.
Hopefully this gives at least a hint of how to continue. Buzzwords to look for are "multilinear algebra", "tensor products", "tensors", "tensor analysis", and "multivariable taylors theorem".
I do not have a super great reference for this because, even though I do analysis in Several Complex Variables, I have somehow never found a book that treats higher dimensional real analysis really well. I am sure there are books out there, but I have worked out most of this stuff on my own. As far as I know there was never a course offered on it at any university I went to! I guess people are supposed to sort of absorb this stuff when they learn differential geometry.
If we start out with $f:\mathbb{R}\to\mathbb{R}$, then $f'(c)$ is an approximation of how $f$ changes in a small interval around $x=c$. For example, let $f(x)=x^3$, and $c=2$. Then $f'(2)=12$. Notice that $f(2.01)=8.120601$. Then the change from $f(2)$ to $f(2.01)$ is $0.120601$. This is approximately $12(.01)$.
For higher dimensions, the derivative needs to be a transformation between $\mathbb{R}^n$ and $\mathbb{R}^m$. For example, take $f(x,y)=(x+y,y^2)$. Then the derivative is
$$
\begin{pmatrix}
1 & 1 \\ 0 & 2y
\end{pmatrix}.
$$
At $(x,y)=(2,1)$ this is
$$
\begin{pmatrix}
1 & 1 \\ 0 & 2
\end{pmatrix}.
$$
Moving on, $f(2,1)=(3,1)$ and $f(2.01,1.01)=(3.02,1.0201)$. The change between the function values is $(0.02,0.0201)$. Notice that
$$
\begin{pmatrix}
1 & 1 \\ 0 & 2
\end{pmatrix}
\begin{pmatrix}
0.01 \\ 0.01
\end{pmatrix}
=
\begin{pmatrix}
0.02 \\ 0.02
\end{pmatrix}.
$$
Again, very close. So, the derivative is the linear transformation that most closely fits the function. Since linear transformations are much easier to study than functions in general, we may learn a lot about the function from its derivatives.
Best Answer
When we say "The derivative is linear", we most commonly mean something like "The derivative operator, which takes a function to its derivative, is a linear transformation on a vector space of functions". In other words, $$D(f+g)(a)(v) = Df(a)(v) + Dg(a)(v)$$ and $$D(k\cdot f)(a)(v) = k\cdot Df(a)(v)$$ for any differentiable functions $f, g: \Bbb R^m\to \Bbb R^n$, any point $a\in \Bbb R^m$ and any scalar $k\in \Bbb R$.
Yes, it's true that with respect to the point we are differentiating at, the derivative is not linear at all; in general we do not have $Df(x+y) = Df(x) + Df(y)$. I would call this something like "The derivative of $f$ is not a linear transformation".
It is also true that with a function $f$ and a point $a$, $Df(a)$ can be seen as a linear map: the linear map which, together with adding the constant $f(a)$, closest resembles the behaviour of $f$ in the immediate vicinity to $a$. I would call this something like "The derivative of $f$ at $a$ is a linear transformation".