It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
The line "Then $D[\nabla f(x)]$ must be a $1\times n$ matrix" is where your confusion lies.
The derivative operator $D$ applied to a vector gives us how each component changes with each direction. Being more explicit with the notation we have
$$\begin{align}\nabla f(\mathbf x) &= D[f (\mathbf x)]\\
&= \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)\end{align}$$
Now think of applying $D$ to each element of this vector individually;
$$\begin{align}D[\nabla f(\mathbf x)] &= D[D[f(\mathbf x)]]\\
&=\left(D\left[\frac{\partial f}{\partial x_1}\right]^T, \ldots, D\left[\frac{\partial f}{\partial x_n}\right]^T\right)\end{align}$$
Which expands to give us the Hessian matrix
$$D^2[f(\mathbf x)]=\left(\begin{matrix}\frac{\partial^2 f}{\partial x_1^2} & \ldots & \frac{\partial^2 f}{\partial x_1\partial x_n}\\
\vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_n\partial x_1}& \ldots & \frac{\partial^2 f}{\partial x_n^2}\end{matrix}\right)$$
which is indeed $n\times n$.
Best Answer
Matrix derivation is just a particular case of Fréchet derivative between two Banach spaces. Which by the way is very similar in term of definition to the definition of the derivative of a function $f : \mathbb R^n \to \mathbb R$ provided in the question.
Applied to matrix derivatives, you just have to consider a map $f : V \to M$ where $M$ is a linear space of matrices endowed with the norm of your choice and $V$ a Banach space that can be (or not) of finite dimension.