It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
If $\vec{g}=\left[\begin{array}{c}f_1\\\vdots\\f_m\end{array}\right]$ then the derivative of $\vec{g}$ is the matrix
$$J\vec{g}=\left[\begin{array}{c}\nabla f_1\\\vdots\\\nabla f_m\end{array}\right],$$
which is an $m\times n$ - rectangular array.
In components, you would see it as
$$J\vec{g}=\left[\dfrac{\partial f_i}{\partial x_j}\right],$$
where $i$ is for rows and $j$ is for columns, and where $x_1,...,x_n$ are the standard coordinate functions of $\Bbb R^n$.
Best Answer
Note that $d\phi(y)$ is just multiplication by the scalar $\phi'(y)$. The chain rule $$d\psi(x)=d\phi\bigl(f(x)\bigr)\circ df(x)$$ therefore implies $$\nabla\psi(x)\cdot X=d\psi(x).X=d\phi\bigl(f(x)\bigr).\bigl(df(x).X\bigr) =\phi'\bigl(f(x)\bigr) \bigl(\nabla f(x)\cdot X\bigr)\ .$$ Since this is true for all $X\in{\mathbb R}^n$ it follows that $$\nabla\psi(x)=\phi'\bigl(f(x)\bigr)\nabla f(x)\ .$$