[Math] Total and partial derivative terminology in scalar- and vector-valued functions of scalars/vectors

intuitionmultivariable-calculuspartial derivativeterminology

I have the following terminology questions which are often not well addressed in an undergraduate multivariable calculus course. While the questions are long, I expect their answers will be short.

As background, say I have two functions that depend on scalar variables $w$, $x$, $y$, and $z$ which each output a scalar or a vector – $f(w, x, y, z)$ and $\boldsymbol{f}(w, x, y, z)$. I know that $\frac{\partial f}{\partial w}$, $\frac{\partial f}{\partial x}$, $\frac{\partial f}{\partial y}$, and $\frac{\partial f}{\partial z}$ are all called partial derivatives of $f$ and they represent the change in $f$ when varying the variable in the denominator while keeping the others constant. I also know that I can group all 4 independent variables into a vector $\boldsymbol{r} = [w\, x\, y\, z]^T$ such that $f(\boldsymbol{r})$ and $\boldsymbol{f}(\boldsymbol{r})$. I can alternatively divide this grouping into several vectors whose dimensions sum to the dimension of $\boldsymbol{r}$. For example, $f(\boldsymbol{p}, \boldsymbol{q})$ or $\boldsymbol{f}(\boldsymbol{p}, \boldsymbol{q})$ where $\boldsymbol{p} = [w\, x]^T$ and $\boldsymbol{q} = [y\, z]^T$.

Are $\frac{\partial \boldsymbol{f}}{\partial w}$, $\frac{\partial \boldsymbol{f}}{\partial x}$, $\frac{\partial \boldsymbol{f}}{\partial y}$, and $\frac{\partial \boldsymbol{f}}{\partial z}$ (which are vectors) all called partial derivatives of $\boldsymbol{f}$? Do they similarly represent change in $\boldsymbol{f}$ as a result of varying the variable in the denominator while keeping the others constant?
Would $\frac{\partial f}{\partial \boldsymbol{r}}$ (which is a vector and often called $\nabla f$ or $\nabla f^T$ depending on convention) be called the total derivative of $f$, and therefore would it be correct to use the notation $\frac{df}{d\boldsymbol{r}}$? I know this is a vector that points in the direction which most increases the value of $f$ and is not a number like the total derivative would be in single variable calculus. However I recognize that the number you get in single variable calculus is essentially a 1D gradient and plays exactly the same role.
Would $\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}$ (which is a matrix) be called the total derivative of $\boldsymbol{f}$? I have heard this called the pushforward, differential, or Jacobian, but is it also correct to think of it as a total derivative and therefore use the notation $\frac{d\boldsymbol{f}}{d\boldsymbol{r}}$? I know that this matrix basically contains the gradients of the components of $\boldsymbol f$, which are scalar functions of $\boldsymbol r$. Therefore, invoking the idea of the gradient, I might think of this matrix as some transformation which will most increase the magnitude of $\boldsymbol f$, but given that the gradient of each component could point in arbitrary directions, I would not be able to choose an $\boldsymbol r$ which will align with all of these directions simultaneously. Is there an intuitive gradient-like interpretation of this matrix or should I only think of it as a mapping between small changes in input and output, or a first order approximation of $\boldsymbol{f}$)?
Would $\frac{\partial f}{\partial \boldsymbol{p}}$, $\frac{\partial f}{\partial \boldsymbol{q}}$ (which are vectors) and $\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{p}}$, $\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{q}}$ (which are matrices) also be called partial derivatives of $f$ and $\boldsymbol f$? I assume these describe the change in $f$ and $\boldsymbol f$ due to changes in one independent vector while keeping the other(s) constant. Is it also correct to say that they capture what the change in $f$ and $\boldsymbol f$ would be if you varied some group of the underlying scalar variables (in this case $w$, $x$ or $y$, $z$) while keeping the others constant? I recognize that some sort of union of these derivatives constitutes the derivatives asked about in the previous two questions. Can the gradient-like notion of pointing in the direction of greatest increase of the function be invoked here (at least in the case of $f$)?

Best Answer

I'd like to point out that if at any point you think I'm too picky/pedantic about the terminology, then just ignore it and move on, because ultimately the ideas/concepts of differential calculus are what's truly important. Before you read my answer, I'd highly recommend you read the preface to chapter 3 in Loomis and Sternberg's Advanced Calculus (page 116 of the book), where the choice of words has been made very consciously, in order to orient the reader into thinking about differential calculus from a very specific point of view.

After reading the preface, I hope you noticed how much it emphasises the role of differential calculus as the theory of approximations by linear functions. As such, any kind of derivative $\dfrac{\partial f}{\partial \boldsymbol{r}}$ or $\dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}$, or $\dfrac{\partial f}{\partial x_i}$ or $\dfrac{\partial \boldsymbol{f}}{\partial x_i}$, to be meaningful has to somehow convey the idea of linearly approximating a certain function; otherwise, it's just a collection of numbers arranged in a matrix in a funny way for no apparent reason.

Let's start with the definition of differentiability in higher dimensions: let $V$ and $W$ be (finite-dimensional) normed vector spaces, and let $F: V \to W$ be a function. We define $F$ to be differentiable at a point $a \in V$ if there is a linear transformation $T: V \to W$, and a function $\varphi: V \to W$ such that for all $h \in V$, we have \begin{equation} F(a+h) - F(a) = T(h) + \varphi(h), \end{equation} and $\varphi$ satisfies $\lim \limits_{h \to 0} \dfrac{\varphi(h)}{\lVert h\rVert} = 0$. If such a $T$ exists, one can prove it is unique, and we denote it by the symbol $dF_a$, or $DF(a)$, or sometimes even $J_F(a)$. Also, rather than $\varphi$, we write $o(h)$ (little oh of $h$). The equation above can now be written as \begin{align} \Delta F_a(h) = dF_a(h) + o(h) \tag{*} \end{align} So, equation (*) says that $dF_a$ is a linear function which approximates the (possibly non-linear) function $\Delta F_a$, up to an accuracy of little oh. In this sense, the first taylor approximation of a function is literally true by definition of "differentiable". I think $dF_a$ is the only thing which should be called the total derivative of $F$ at $a$, because simply by definition it best approximates changes in $F$ near $a$, and secondly, we can show that $dF_a$ contains all the necessary information. For example, in the special case $V= \Bbb{R^n}$ and $W= \Bbb{R}$, if we evaluate this linear transformation on the vector $e_i = (0, \dots, 1, \dots, 0)$, with $1$ in $i^{th}$ place, we get the partial derivative everyone is familiar with: $\dfrac{\partial F}{\partial x_i}(a)$. Once again: $dF_a$ contains as a special case, all the information of partial derivatives.

In linear algebra one often learns that since $dF_a: V \to W$ is a linear transformation, by choosing a basis $\beta$ for $V$ and a basis $\gamma$ for $W$, we can encode ALL the relevant information about $dF_a$, in a matrix, which we denote by $[dF_a]_{\beta}^{\gamma}$. And in the special case where $V$ and $W$ are Cartesian spaces; i.e $V = \Bbb{R}^n$ and $W = \Bbb{R}^m$, then we often choose $\beta$ and $\gamma$ to be the standard basis. Only in this special case, $[dF_a]_{\beta}^{\gamma}$ will be an $m \times n$ matrix whose ij entry is the partial derivative $\dfrac{\partial F_i}{\partial x_j}(a)$.

Yes, and yes. For example, with all the preliminary stuff above, now you can say $\dfrac{\partial f}{\partial w}(a) = df_a(1,0,0,0)$; i.e if you are at the point $a$, and you are displaced by an amount $(1,0,0,0)$, then the function will change by an amount $\Delta f_a(1,0,0,0) \approx df_a(1,0,0,0)$ (the $\approx$ of course means that we have an error term of $o(1,0,0,0)$)
Now, with all of the work we have done above, we can be more precise and state the following: \begin{equation} \dfrac{\partial f}{\partial \boldsymbol{r}}(a) = \begin{bmatrix} \dfrac{\partial f}{\partial w}(a) & \dfrac{\partial f}{\partial x}(a) & \dfrac{\partial f}{\partial y}(a) & \dfrac{\partial f}{\partial z}(a) \end{bmatrix} \end{equation} is the matrix representation of the differential/total derivative/Frechet derivative (these are all just names for this linear transformation) $df_a: \Bbb{R^4} \to \Bbb{R}$, relative to the standard bases. Now, here the notation $\dfrac{\partial }{\partial \boldsymbol{r}}$ makes some sense, because $\boldsymbol{r} = (w,x,y,z)$... so if we play around with the symbols, things work out nicely, and you have partial derivatives inside the $1 \times 4$ matrix. However, $\dfrac{d}{d \boldsymbol{r}}$ would be bad, because it doesn't "nicely distribute" inside the matrix (the entries of the matrix use partial derivatives not straight $d$).

So, once again to emphasise: $df_a$ IS the total derivative, while $\dfrac{\partial f}{\partial \boldsymbol{r}}(a) = [df_a]_{\beta}^{\gamma}$, is the matrix representation relative to the standard bases. Many people wouldn't bother with the distinction between a matrix and a linear transformation when dealing with cartesian spaces, because we have a "standard basis" $\{e_1, \dots, e_n\}$. But of course, these are clearly different objects... one is a linear transformation, the other is a collection of numbers arranged in a funny way. So, what usually happens is that people who know this difference, and how the two are related are sometimes too lazy to explain it, and hence simply say they are the same thing. This is why I said "depending on who you ask" in the comments.

Now, you should be able to guess my answer: $\dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}(a)$ is the matrix representation of $d \boldsymbol{f}_a$ relative to the standard basis of the domain ($\mathbb{R^n}$) and target space ($\Bbb{R^m}$) of $\boldsymbol{f}$. Once again, I wouldn't use the $\dfrac{d}{d \boldsymbol{r}}$ notation, because elements of the matrix are partial derivatives, so using straight $d$ is just bad. The most accurate thing to write is \begin{equation} \dfrac{\partial \boldsymbol{f}}{\partial \boldsymbol{r}}(a) = [d \boldsymbol{f}_a]_{\beta}^{\gamma} \end{equation} You could also call this "the matrix representation of the differential $d \boldsymbol {f}_a$" or "the Jacobian matrix of $f$ at $a$" for short. However, the term "push forward" is one that's used in differential geometry to describe a certain linear transformation between tangent spaces of a manifold. And it is the matrix representation of this linear map, relative to a choice of basis for the tangent spaces that equals the jacobian matrix of partial derivatives. If that made sense, great, otherwise, just take that to mean "pushforward" isn't appropriate in this context.

You are right that the rows of the matrix are the gradient of the component functions, but I don't think there is a gradient like interpretation for the whole matrix.

Your last sentence under (3.) seems to suggest you aren't satisfied by the "changes in output due to small input changes"/ "first order linear approximation" interpretation. However, this is really one of the most powerful ideas in differential calculus, and more generally calculus on manifolds! If you're not convinced, once again re-read the preface to Chapter 3 of the book I referenced, and also read the preface to Chapter $9$, you may appreciate/ believe the importance of this point of view more after doing so.

The answer to (4) is yes, provided you interpret it correctly. Note that $\dfrac{\partial f}{\partial \boldsymbol{p}}(a)$ is the $1 \times 2$ submatrix of $\dfrac{\partial f}{\partial \boldsymbol{r}}(a)$, obtained by selecting the first two columns (here, I am thinking of a row vector as a $1 \times n$ matrix). But also note that I've been constantly emphasising that $\dfrac{\partial f}{\partial \boldsymbol{r}}(a)$ is the matrix representation of $df_a$ relative to the standard basis. So, one might expect that this submatrix is actually the matrix representation of a certain linear transformation from $\Bbb{R^2}$ into $\Bbb{R}$. This is indeed the case, and this linear transformation is what approximates the change in $f$, when a certain group of vectors are held constant, while another group of vectors are varied. Hopefully this is satisfactory enough, but if not, you can read the construction of the so called "partial differential" below which makes these statements precise. (See also section $3.8$ of the book)

Construction of the "partial differential" of a function.

Let $V_1, \dots, V_k, W$ be normed vector spaces. Define $V = V_1 \times \dots \times V_k$ (cartesian product). Let $F: V \to W$ be a function, and fix a point $a = (a_1, \dots, a_k) \in V$ and fix an index $i \in \{1, \dots, k\}$. Consider the new function $g: V_i \to W$ defined by \begin{equation} g(\xi) = F(a_1, \dots a_{i-1}, \xi, a_{i+1}, \dots, a_k) \end{equation} If $g$ is differentiable at the point $a_i$ (in the sense defined in the beginning), then we say $F$ has an $i^{th}$ partial differential at $a$, and define \begin{equation} \partial^i F_a = dg_{a_i} \end{equation} So $\partial^iF_a$ is a linear transformation from $V_i$ into $W$ (because $dg_{a_i}$ is). (I only put the $i$ in superscript because the subscript position is already very cluttered... so there's no deep mah here)

As far as definitions go, that is perfectly rigorous, but it may be hard to digest. What we are doing is very similar to how partial derivatives are normally defined. We have a function $F$ of $k$ vector variables, and we have a point $a = (a_1, \dots, a_k)$ of interest. What we are doing is keeping all the entries fixed except the $i^{th}$ one; so $g$ only changes the $i^{th}$ variable of $F$. Then, we consider the differentiability of $g$ at $a_i$. If you unwind the definitions carefully, you'll see that the equation \begin{equation} \Delta g_{a_i}(h) = dg_{a_i}(h) + o(h) \end{equation} is equivalent to \begin{equation} \Delta F_a(0, \dots h, \dots, 0) = (\partial^i F_a)(h) + o(h), \end{equation} where $h$ is in the $i^{th}$ slot. Hence, $\partial^i F_a$ linearly approximates changes in $F$ up to an accuracy of little-oh, when you fix all the vector variables at $a_j$ for $j \neq i$, and you only vary the $i^{th}$ variable around $a_i$.

So, for your particular example, you had $f: \mathbb{R^4} \to \Bbb{R}$, which as you said, can be thought of as a function of two vector variables, i.e $f: \Bbb{R^2} \times \Bbb{R^2} \to \Bbb{R}$, with $\boldsymbol{p}, \boldsymbol{q} \in \Bbb{R^2}$ (we are choosing $V_1=V_2 = \Bbb{R^2}$ and $W = \Bbb{R}$ if we follow the notation of the construction above). So, if we now fix a point $a = ((a_1,a_2), (a_3,a_4)) \in \Bbb{R^2} \times \Bbb{R^2}$, then the $1 \times 2$ matrix $\dfrac{\partial f}{\partial \boldsymbol{p}}(a)$ is the matrix representation of $\partial^1f_a : \Bbb{R^2} \to \Bbb{R}$ relative to the standard bases, while $\dfrac{\partial f}{\partial \boldsymbol{q}}(a)$ is the matrix representation of $\partial^2f_a$ relative to the standard bases.

Similarly, you can group the vector variables in any other way you like.

This was long but hopefully it is helpful

Best Answer

Related Solutions

[Math] (higher order) gradient of vector valued functions

Related Question