It turns out that there are two different but related notions of differentiation for a function $f:\mathbb R^n\to\mathbb R$: the total derivative $df$ and the gradient $\nabla f$.
- The total derivative is a covector ("dual vector", "linear form") and does not depend on the choice of a metric ("measure of length").
- The gradient is an ordinary vector and derived from the total derivative, but it depends on a metric. That why it looks a bit funny in different coordinate systems.
The definition of the total derivative answers the following question: given a vector $\vec v$, what is the slope of the function $f$ in the direction of $\vec v$? The answer is, of course
$$ df_{x}(\vec v) = \lim_{t\to0} \frac{f(x+t\vec v)-f(x)}{t}$$
I.e. you start at the point $x$ and walk a teensy bit in the direction of $\vec v$ and take note of the ratio $\Delta f/\Delta t$.
Note that the total derivative is a linear map $\mathbb R^n \to \mathbb R$, not a vector in $\mathbb R^n$. Given a vector, it tells you some number. In coordinates, this is usually written as
$$ df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial z}dz $$
where $dx,dy,dz$ are the total derivatives of the coordinate functions, for instance $dx(v_x,v_y,v_z) := v_x$. This formula looks the same in any coordinate system.
In contrast, the gradient answers the following question: what is the direction of the steepest ascend of the function? Which vector $\vec v$ of unit length maximizes the function $df(\vec v)$? As you can see, this definition crucially depends on the fact that you can measure the length of a vector. The gradient is then defined as
$$ \nabla f = df(\vec v_{max})\cdot\vec v_{max} $$
i.e. it gives both the direction and the magnitude of the steepest change.
This can also be expressed as
$$ \langle \nabla f, \vec v \rangle = df(\vec v) \quad\forall \vec v\in\mathbb R^n.$$
In other words, the scalar product $\langle,\rangle$ is used to convert a covector $df$ into a vector $\nabla f$. This also means that the formula for the gradient looks very different in coordinate systems other than cartesian. If the scalar product is changed (say, to $\langle\vec a,\vec b\rangle := a_xb_x + a_yb_y + 4a_zb_z$), then the direction of steepest ascend also changes. (Exercise: Why?)
Best Answer
This is a classic example of why treating something like $\frac{dy}{dx}$ as a literal fraction rather than as shorthand notation for a limit is bad. If you want to derive it from the differentials, you should compute the square of the line element $ds^2 .$ Start with $$ds^2 = dx^2 + dy^2 + dz^2$$ in Cartesian coordinates and then show
$$ds^2 = dr^2 + r^2 d\theta^2 + r^2 \sin^2 (\theta) d\varphi^2 \; .$$ The coefficients on the components for the gradient in this spherical coordinate system will be 1 over the square root of the corresponding coefficients of the line element. In other words
$$\nabla f = \begin{bmatrix} \frac{1}{\sqrt{1}}\frac{\partial f}{\partial r} & \frac{1}{\sqrt{r^2}}\frac{\partial f}{\partial \theta} & \frac{1}{\sqrt{r^2\sin^2\theta}}\frac{\partial f}{\partial \varphi} \end{bmatrix} \; .$$ Keep in mind that this gradient has nomalized basis vectors.
For a general coordinate system (which doesn't necessarily have an orthonormal basis), we organize the line element into a symmetric "matrix" with two indices $g_{ij} .$ If the line element contains a term like $f(\mathbf x)dx_kdx_\ell\; \;$ then $g_{k\ell} = f(\mathbf x).\;$ The gradient is then expressed as
$$\nabla f = \sum_i \sum_j \frac{\partial f}{\partial x_i}g^{ij}\mathbf e_j$$ where $\mathbf e_j$ is not necessarily a normalized vector and $g^{ij}$ is the matrix inverse of $g_{ij}$.