I'd like to point out that if at any point you think I'm too picky/pedantic about the terminology, then just ignore it and move on, because ultimately the ideas/concepts of differential calculus are what's truly important. Before you read my answer, I'd highly recommend you read the preface to chapter 3 in Loomis and Sternberg's Advanced Calculus (page 116 of the book), where the choice of words has been made very consciously, in order to orient the reader into thinking about differential calculus from a very specific point of view.
After reading the preface, I hope you noticed how much it emphasises the role of differential calculus as the theory of approximations by linear functions. As such, any kind of derivative $\dfrac{\partial f}{\partial \boldsymbol{r}}$ or $\dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}$, or $\dfrac{\partial f}{\partial x_i}$ or $\dfrac{\partial \boldsymbol{f}}{\partial x_i}$, to be meaningful has to somehow convey the idea of linearly approximating a certain function; otherwise, it's just a collection of numbers arranged in a matrix in a funny way for no apparent reason.
Let's start with the definition of differentiability in higher dimensions: let $V$ and $W$ be (finite-dimensional) normed vector spaces, and let $F: V \to W$ be a function. We define $F$ to be differentiable at a point $a \in V$ if there is a linear transformation $T: V \to W$, and a function $\varphi: V \to W$ such that for all $h \in V$, we have
\begin{equation}
F(a+h) - F(a) = T(h) + \varphi(h),
\end{equation}
and $\varphi$ satisfies $\lim \limits_{h \to 0} \dfrac{\varphi(h)}{\lVert h\rVert} = 0$. If such a $T$ exists, one can prove it is unique, and we denote it by the symbol $dF_a$, or $DF(a)$, or sometimes even $J_F(a)$. Also, rather than $\varphi$, we write $o(h)$ (little oh of $h$). The equation above can now be written as
\begin{align}
\Delta F_a(h) = dF_a(h) + o(h) \tag{*}
\end{align}
So, equation (*) says that $dF_a$ is a linear function which approximates the (possibly non-linear) function $\Delta F_a$, up to an accuracy of little oh. In this sense, the first taylor approximation of a function is literally true by definition of "differentiable". I think $dF_a$ is the only thing which should be called the total derivative of $F$ at $a$, because simply by definition it best approximates changes in $F$ near $a$, and secondly, we can show that $dF_a$ contains all the necessary information. For example, in the special case $V= \Bbb{R^n}$ and $W= \Bbb{R}$, if we evaluate this linear transformation on the vector $e_i = (0, \dots, 1, \dots, 0)$, with $1$ in $i^{th}$ place, we get the partial derivative everyone is familiar with: $\dfrac{\partial F}{\partial x_i}(a)$. Once again: $dF_a$ contains as a special case, all the information of partial derivatives.
In linear algebra one often learns that since $dF_a: V \to W$ is a linear transformation, by choosing a basis $\beta$ for $V$ and a basis $\gamma$ for $W$, we can encode ALL the relevant information about $dF_a$, in a matrix, which we denote by $[dF_a]_{\beta}^{\gamma}$. And in the special case where $V$ and $W$ are Cartesian spaces; i.e $V = \Bbb{R}^n$ and $W = \Bbb{R}^m$, then we often choose $\beta$ and $\gamma$ to be the standard basis. Only in this special case, $[dF_a]_{\beta}^{\gamma}$ will be an $m \times n$ matrix whose ij entry is the partial derivative $\dfrac{\partial F_i}{\partial x_j}(a)$.
- Yes, and yes. For example, with all the preliminary stuff above, now you can say $\dfrac{\partial f}{\partial w}(a) = df_a(1,0,0,0)$; i.e if you are at the point $a$, and you are displaced by an amount $(1,0,0,0)$, then the function will change by an amount $\Delta f_a(1,0,0,0) \approx df_a(1,0,0,0)$ (the $\approx$ of course means that we have an error term of $o(1,0,0,0)$)
- Now, with all of the work we have done above, we can be more precise and state the following:
\begin{equation}
\dfrac{\partial f}{\partial \boldsymbol{r}}(a) =
\begin{bmatrix}
\dfrac{\partial f}{\partial w}(a) & \dfrac{\partial f}{\partial x}(a) & \dfrac{\partial f}{\partial y}(a) & \dfrac{\partial f}{\partial z}(a)
\end{bmatrix}
\end{equation}
is the matrix representation of the differential/total derivative/Frechet derivative (these are all just names for this linear transformation) $df_a: \Bbb{R^4} \to \Bbb{R}$, relative to the standard bases. Now, here the notation $\dfrac{\partial }{\partial \boldsymbol{r}}$ makes some sense, because $\boldsymbol{r} = (w,x,y,z)$... so if we play around with the symbols, things work out nicely, and you have partial derivatives inside the $1 \times 4$ matrix. However, $\dfrac{d}{d \boldsymbol{r}}$ would be bad, because it doesn't "nicely distribute" inside the matrix (the entries of the matrix use partial derivatives not straight $d$).
So, once again to emphasise: $df_a$ IS the total derivative, while $\dfrac{\partial f}{\partial \boldsymbol{r}}(a) = [df_a]_{\beta}^{\gamma}$, is the matrix representation relative to the standard bases. Many people wouldn't bother with the distinction between a matrix and a linear transformation when dealing with cartesian spaces, because we have a "standard basis" $\{e_1, \dots, e_n\}$. But of course, these are clearly different objects... one is a linear transformation, the other is a collection of numbers arranged in a funny way. So, what usually happens is that people who know this difference, and how the two are related are sometimes too lazy to explain it, and hence simply say they are the same thing. This is why I said "depending on who you ask" in the comments.
- Now, you should be able to guess my answer: $\dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}(a)$ is the matrix representation of $d \boldsymbol{f}_a$ relative to the standard basis of the domain ($\mathbb{R^n}$) and target space ($\Bbb{R^m}$) of $\boldsymbol{f}$. Once again, I wouldn't use the $\dfrac{d}{d \boldsymbol{r}}$ notation, because elements of the matrix are partial derivatives, so using straight $d$ is just bad. The most accurate thing to write is
\begin{equation}
\dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}(a) = [d \boldsymbol{f}_a]_{\beta}^{\gamma}
\end{equation}
You could also call this "the matrix representation of the differential $d \boldsymbol {f}_a$" or "the Jacobian matrix of $f$ at $a$" for short. However, the term "push forward" is one that's used in differential geometry to describe a certain linear transformation between tangent spaces of a manifold. And it is the matrix representation of this linear map, relative to a choice of basis for the tangent spaces that equals the jacobian matrix of partial derivatives. If that made sense, great, otherwise, just take that to mean "pushforward" isn't appropriate in this context.
You are right that the rows of the matrix are the gradient of the component functions, but I don't think there is a gradient like interpretation for the whole matrix.
Your last sentence under (3.) seems to suggest you aren't satisfied by the "changes in output due to small input changes"/ "first order linear approximation" interpretation. However, this is really one of the most powerful ideas in differential calculus, and more generally calculus on manifolds! If you're not convinced, once again re-read the preface to Chapter 3 of the book I referenced, and also read the preface to Chapter $9$, you may appreciate/ believe the importance of this point of view more after doing so.
The answer to (4) is yes, provided you interpret it correctly. Note that $\dfrac{\partial f}{\partial \boldsymbol{p}}(a)$ is the $1 \times 2$ submatrix of $\dfrac{\partial f}{\partial \boldsymbol{r}}(a)$, obtained by selecting the first two columns (here, I am thinking of a row vector as a $1 \times n$ matrix). But also note that I've been constantly emphasising that $\dfrac{\partial f}{\partial \boldsymbol{r}}(a)$ is the matrix representation of $df_a$ relative to the standard basis. So, one might expect that this submatrix is actually the matrix representation of a certain linear transformation from $\Bbb{R^2}$ into $\Bbb{R}$. This is indeed the case, and this linear transformation is what approximates the change in $f$, when a certain group of vectors are held constant, while another group of vectors are varied. Hopefully this is satisfactory enough, but if not, you can read the construction of the so called "partial differential" below which makes these statements precise. (See also section $3.8$ of the book)
Construction of the "partial differential" of a function.
Let $V_1, \dots, V_k, W$ be normed vector spaces. Define $V = V_1 \times \dots \times V_k$ (cartesian product). Let $F: V \to W$ be a function, and fix a point $a = (a_1, \dots, a_k) \in V$ and fix an index $i \in \{1, \dots, k\}$. Consider the new function $g: V_i \to W$ defined by
\begin{equation}
g(\xi) = F(a_1, \dots a_{i-1}, \xi, a_{i+1}, \dots, a_k)
\end{equation}
If $g$ is differentiable at the point $a_i$ (in the sense defined in the beginning), then we say $F$ has an $i^{th}$ partial differential at $a$, and define
\begin{equation}
\partial^i F_a = dg_{a_i}
\end{equation}
So $\partial^iF_a$ is a linear transformation from $V_i$ into $W$ (because $dg_{a_i}$ is). (I only put the $i$ in superscript because the subscript position is already very cluttered... so there's no deep mah here)
As far as definitions go, that is perfectly rigorous, but it may be hard to digest. What we are doing is very similar to how partial derivatives are normally defined. We have a function $F$ of $k$ vector variables, and we have a point $a = (a_1, \dots, a_k)$ of interest. What we are doing is keeping all the entries fixed except the $i^{th}$ one; so $g$ only changes the $i^{th}$ variable of $F$. Then, we consider the differentiability of $g$ at $a_i$. If you unwind the definitions carefully, you'll see that the equation
\begin{equation}
\Delta g_{a_i}(h) = dg_{a_i}(h) + o(h)
\end{equation}
is equivalent to
\begin{equation}
\Delta F_a(0, \dots h, \dots, 0) = (\partial^i F_a)(h) + o(h),
\end{equation}
where $h$ is in the $i^{th}$ slot. Hence, $\partial^i F_a$ linearly approximates changes in $F$ up to an accuracy of little-oh, when you fix all the vector variables at $a_j$ for $j \neq i$, and you only vary the $i^{th}$ variable around $a_i$.
So, for your particular example, you had $f: \mathbb{R^4} \to \Bbb{R}$, which as you said, can be thought of as a function of two vector variables, i.e $f: \Bbb{R^2} \times \Bbb{R^2} \to \Bbb{R}$, with $\boldsymbol{p}, \boldsymbol{q} \in \Bbb{R^2}$ (we are choosing $V_1=V_2 = \Bbb{R^2}$ and $W = \Bbb{R}$ if we follow the notation of the construction above). So, if we now fix a point $a = ((a_1,a_2), (a_3,a_4)) \in \Bbb{R^2} \times \Bbb{R^2}$, then the $1 \times 2$ matrix $\dfrac{\partial f}{\partial \boldsymbol{p}}(a)$ is the matrix representation of $\partial^1f_a : \Bbb{R^2} \to \Bbb{R}$ relative to the standard bases, while $\dfrac{\partial f}{\partial \boldsymbol{q}}(a)$ is the matrix representation of $\partial^2f_a$ relative to the standard bases.
Similarly, you can group the vector variables in any other way you like.
This was long but hopefully it is helpful
Best Answer
The problem is not writing compositions properly. If $f$ is a function of $x$ and $y$ and those are functions of $t$ and $u$, then $t \mapsto f(x(t,u),y(t,u))$ is a function of $t$ and $u$, but $f$ itself is not a function of $t$ and $u$, this is an abuse of notation! Rigorously, one defines $\widetilde{f}(t,u)=f(x(t,u),y(t,u))$ and the chain rule says that $$\frac{\partial \widetilde{f}}{\partial t}(t,u) = \frac{\partial f}{\partial x}(x(t,u),y(t,u)) \frac{\partial x}{\partial t}(t,u)+\frac{\partial f}{\partial y}(x(t,u),y(t,u)) \frac{\partial y}{\partial t}(t,u).$$Introducing this extra letter $\widetilde{f}$ just for the sake of being formal is something that people usually don't try to do, so what you're denoting by $\partial f/\partial u$ is actually $\partial \widetilde{f}/\partial u$ for $\widetilde{f}$ defined as above.
If $x$ (for instance) is a function of $t$ and $u$, ${\rm d}x/{\rm d}t$ doesn't make sense, only $\partial x/\partial t$ does. One can make sense of this using the same mechanism above. Namely, define $\widetilde{x}(t) = x(t,u)$ where $u$ is fixed. So $$\frac{{\rm d}\widetilde{x}}{{\rm d}t}(t) = \frac{\partial x}{\partial t}(t,u),$$and $u$ is fixed everywhere throughout this process. One could even denote $\widetilde{x}$ by $\widetilde{x}_u$ (not a partial derivative) to make explicit that a given $u$ has been chosen.