The total derivative is only a piece of notation to overcome some difficulties when dealing with Leibniz notation.
Consider the function $f(x,y)=x^2+y$. If you agree that $x$ and $y$ in the definition of $f$ are just placeholders you should agree that $f(y,x)=y^2+x$ and that $f(t,t^2)=2t^2$.
Now the question is: what does $\partial f / \partial x$ means? Written like that one would interpret $\partial/\partial x$ as the derivative of $f$ with respect to the first variable. Notice here that the variable $x$ is no more a simple place-holder but has a conventional meaning.
For example consider the following:
$$
\frac{\partial f(y,x)}{\partial x}.
$$
Now the interpretation is not clear... you mean the derivative with respect to the first or the second variable?
Mixing variables like that is not good... but sometimes one should be prepared to solve the ambiguity. Consider a function $f(x,t)$ which represent a quantity which depends on space $x$ and time $t$. So it is understood that $\partial f/\partial t$ is the derivative with respect to the second component (which is time). Suppose now that you have a particle which moves with the law $x=t^2$. If you evaluate the function $f$ on the particle you get
$$
f(x(t),t)
$$
and if you want to compute the derivative of this function you can use the chain rule and obtain:
$$
\frac{d}{dt} f(x(t),t) = \frac{\partial}{\partial x} f(x(t),t)\cdot x'(t) + \frac{\partial }{\partial t} f(x(t),t).
$$
Now the point is that often it is useful to reduce the notation writing $x$ in place of $x(t)$ and $f$ in place of $f(x,t)$ so that previous formula could be written as
$$
\frac{d}{dt} f(x,t) = \frac{\partial}{\partial x} f(x,t) \cdot x'(t) + \frac{\partial }{\partial t} f(x,t)
$$
or
$$
\frac{df}{dt} = \frac{\partial f}{\partial x} \cdot x' + \frac{\partial f}{\partial t}.
$$
Now you see that $d/dt$ and $\partial/\partial t$ assume different meanings...
Let $X,Y$ and $Z$ be vector spaces, and $\mathcal{L}(A,B)$ the space of all linear maps from $A$ to $B$.
As noted above, if $F \in \mathcal{L}(X,\mathcal{L}(Y,Z))$, then we can form another map $F_{curry}:X \times Y \to Z$ defined by $F_{curry}(x,y) = F(x)(y)$. Notice that $F_{curry}$ is a bilinear mapping: fixing $x$, $F_{curry}(x,\cdot)$ is linear in the second slot, and $F_{curry} (\cdot,y)$ is linear in the first slot. Conversely, given a bilinear mapping from $G: X \times Y \to Z$, I can produce an element $G_{uncurry}\mathcal{L}(X,\mathcal{L}(Y,Z))$ in the way you would expect: $G_{uncurry}(x)(y) = G(x,y)$. Keyword here is "Curry-howard isomorphism".
So $\mathcal{L}(X,\mathcal{L}(Y,Z))$ can be canonically identified with the space of bilinear mappings from $X \times Y \to Z$. These in tern could be identified with linear mappings from the space $X \otimes Y \to Z$, the so called "tensor product" of $X$ and $Y$, but I will not go into that.
You might be curious how you could work with such an object. What data do you need to write down? For a linear map, you only have to specify action on a basis, but a bilinear map is not a linear map. It turns out (you should check) that specifying the action on all pairs of basis vectors is enough.
Lets get back down to earth and examine a very special case. Let $f:\mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x,y) = x^2y$.
$D(f)\big|_{(x,y)}$ is the linear map given by the matrix $\left[ \begin{matrix} 2xy&x^2\end{matrix} \right]$. That is to say, $D(f)\big|_{(x,y)}(\Delta x,\Delta y) = 2xy\Delta x + x^2\Delta y \approx f(x+\Delta x,y+\Delta y) - f(x,y)$. Notice that the transpose of this matrix is the "gradient" of $f$. Only functions from $\mathbb{R^n} \to \mathbb{R}$ have gradients.
The second derivative should now tell you how much the derivative changes from point to point. If we increment $(x,y)$ by a little bit to $(x+\Delta x,y)$ then we should expect the derivative to increase by about $\left[ \begin{matrix} 2y\Delta x&2x \Delta x\end{matrix} \right]$. Similarly, when we increase $y$ by $\Delta y$, the derivative should change by about $\left[ \begin{matrix} 2x \Delta y&0\Delta y\end{matrix} \right]$.
By linearity, if we change from $(x,y)$ to $(x+\Delta x,y+\Delta y)$, we expect the derivative to change by $$\left[ \begin{matrix} \Delta x&\Delta y\end{matrix} \right] \left[ \begin{matrix} 2y&2x\\2x&0\end{matrix} \right]$$
This gives a matrix which is the approximate change in the derivative. You can then apply this to another vector if you so wish.
Summing it up, if you wanted to see approximately how much the derivative changes from $(x,y)$ to $(x+\Delta x_2,y+\Delta y_2)$ when both are evaluated in the same direction $(\Delta x_1,\Delta y_1)$, you would perform the computation:
$$\left[ \begin{matrix} \Delta x_2&\Delta y_2\end{matrix} \right] \left[ \begin{matrix} 2x&2x\\2x&0\end{matrix} \right] \left[ \begin{matrix} \Delta x_1\\\Delta y_1\end{matrix} \right]$$
The matrix of second partials derived above is called the Hessian, but it a bit misleading to write it as a matrix, since it is really acting as a bilinear form in the manner shown above, i.e. $H(v_1,v_2) = v_1^T H v_2$. You may remember seeing the Hessian arise in multivariable calculus when classifying critical points as maxima, minima, or saddles. In general, the sign eigenvalues of the hessian matrix tell the whole story (although, if there are some zero eigenvalues you might have to climb up the derivative ladder to trilinear forms, etc).
Notice that I only got a Hessian "matrix" because the codomain of $f$ was one dimensional. If it has been, say, $3$ dimensional I would have needed $3$ such matrices, and they would naturally align themselves into a $2\times2\times3$ dimensional box, which would represent a higher order tensor.
Hopefully this gives at least a hint of how to continue. Buzzwords to look for are "multilinear algebra", "tensor products", "tensors", "tensor analysis", and "multivariable taylors theorem".
I do not have a super great reference for this because, even though I do analysis in Several Complex Variables, I have somehow never found a book that treats higher dimensional real analysis really well. I am sure there are books out there, but I have worked out most of this stuff on my own. As far as I know there was never a course offered on it at any university I went to! I guess people are supposed to sort of absorb this stuff when they learn differential geometry.
Best Answer
Assume for the moment that $x$, $y$,$z$ are independent variables such that $(x,y,z)$ ranges in an open convex set of ${\mathbb R}^3$.
${\bf 1.\ }$ If you are given a function $(x,y,z)\mapsto f(x,y,z)$ then the indefinite integral $$\int f(x,y,z)\>dx\tag{1}$$ denotes the set of all functions $(x,y,z)\mapsto F(x,y,z)$ satisfying the condition (or PDE) $${\partial F\over\partial x}(x,y,z)=f(x,y,z)\ .\tag{2}$$ If $F_0$ is a solution of $(2)$, found by guessing or by formally integrating $(1)$ with respect to $x$, then the general solution of $(2)$ is given by $$F(x,y,z)=F_0(x,y,z)+G(y,z)\ ,$$ whereby $G$ is an arbitrary (sufficiently smooth) function of its variables $y$ and $z$.
${\bf 2.\ }$ If $(x,y,z)\mapsto F(x,y,z)$ is a scalar function then ${d\over dx}F(x,y,z)$ makes no sense. There is the derivative or differential $dF(x,y,z)$ of $f$, which is for each ${\bf p}=(x,y,z)$ in the domain of $F$ a linear functional on the tangent space $T_{\bf p}$. The geometric representation of this functional is the gradient vector $$\nabla F({\bf p})=\left({\partial F\over\partial x},{\partial F\over\partial y},{\partial F\over\partial z}\right)_{\bf p}\ .$$ Note that $(x,y,z)\mapsto \nabla F(x,y,z)$ constitutes a vector field on the domain in question.
${\bf 3.}\ $Reversing the operation of taking the gradient vector means finding for a given vector field $${\bf f}(x,y,z)=\bigl(u(x,y,z),v(x,y,z),w(x,y,z)\bigr)$$ a scalar function $F$ such that $$\nabla F(x,y,z)={\bf f}(x,y,z)\ .\tag{3}$$ This is not always possible. A necessary condition is that ${\rm curl}({\bf f})\equiv{\bf 0}$. If this condition is fulfilled then you can find an $F$ satisfying $(3)$ either by a recursive scheme ("nested integrals") involving the problem described in ${\bf 1}$, or by computing line integrals as follows: $$F({\bf p})=F({\bf 0})+\int_{\bf 0}^{\bf p}{\bf f}({\bf x})\cdot d{\bf x}\ ,$$ whereby $F({\bf 0})$ is arbitrary.
${\bf 4.\ }$ A "function $f\bigl(x,y(x)\bigr)$", where $x\mapsto y(x)$ is in principle given is a function $$\phi(x):=f\bigl(x,y(x)\bigr)$$ of one variable $x$, and the usual rules of Calculus 101 plus the multivariable chain rule apply. E.g., $$\phi'(x)=f_{.1}\bigl(x,y(x)\bigr)\cdot 1+f_{.2}\bigl(x,y(x)\bigr)\cdot y'(x)\ ,$$ and $$\int f\bigl(x,y(x)\bigr)\>dx=\Phi(x)+C\ ,$$ whereby $\Phi'(x)=\phi(x)$.