[Math] Derivatives of multivariable functions

derivativesmultivariable-calculuspartial derivativereal-analysis

I would like to make a few statements about a simple object – the derivative of a univariate function – and apply and relate its features and my understaning of them to multivariate functions.

Univariate functions. A derivative of a real function $f: {\mathrm R} \to {\mathrm R}$ at the point $a \in {\mathrm R}$ is the slope of the function at this point; that is how much the function value changes with respect to the change in variable, or

$$f'(a) = \lim_{h\to0} \frac{f(a + h) – f(a)}{h}.$$

The derivate of this a real function $f: {\mathrm R} \to {\mathrm R}$ is the function $$f': a \mapsto f'(a)$$ that maps a point to the slope of the funtion $f$ at that point.

The derivative at a point is not itself the tangent to the function graph at that point, but it is closely related to it. The tangent at the point $a$ can be expressed as

$$t(x) = f(a) + f'(a)(x-a),$$

which happens to be the best linear approximation of the function $f$ around $a$, or the first-degree Taylor polynomial $T^{f,a}_1$.

The function $f'(a)(x-a)$ is linear in $x$.

Multivariate functions. Let $f: {\mathrm R}^n \to {\mathrm R}^m$, where $m,\ n \in {\mathrm N}$. We can consider partial derivatives of $f$ at $a \in {\mathrm R}^n$, defined for example as

$${\partial f\over\partial x_i} = \lim_{h \to 0} \frac{f(a + h{\bf e^i}) – f(a)}{h},$$

that is the derivative of the function at $a$ with respect to $x_i$ and other variables held constant, where ${\bf e^i} = (0, \dots, 0, 1, 0, \dots, 0)$ ($1$ is $i$-th from the left).

These are derivatives of single-variable partial functions and therefore the same applies to them what I have written in the first section.

The gradient of the function at a point is the vector of partial derivatives at that point, i.e.

$$\nabla f(a) = \Big({\partial f\over x_1}, \dots, {\partial f\over x_n}\Big).$$

Its geometrical meaning is that it points in the direction of the steepest growth while its value is the growth in that direction.

The equivalent of the derivative at a point seems to be what is called the total differential at that point. If $L$ is total differential at $a \in {\mathrm R}^n$, then

$$\lim_{\bf h \to 0} \frac{||f(a + h) – f(a) – L(h)||}{||h||} = 0,$$

where $||\cdot||$ is the Euclidian norm, which means that $L$ has the "approximative property" – it appoximates the difference $f(a + h) – f(a)$ locally. If the total differential exists, it can be expressed as $L(h) = \nabla{f}(a) \cdot h$, where $\cdot$ is the dot product.

(I think: the total differential does not approximate the function itself – this resembles "derivative of a function at a point" for univariate functions.)

I come to understand that the derivative at a point of a multivariate function can be defined exactly the same way as the total differential at a point. (For some reason we have only defined the total differential for functions ${\mathrm R}^n \to {\mathrm R}$. Is this related to math, or is it a problem of terminology?)

When I try to look at the derivative $f'$ of the function f, I should see that:

  • $f': {\mathrm R}^n \to {\mathscr L}(\mathrm{R}^n, \mathrm{R}^m)$ – and I do, this simply states that the first derivative at a point is a local linear approximation of the original function f, but also that
  • $f'': {\mathrm R}^n \to {\mathscr L}\Big({\mathrm R}^n, {\mathscr L}({\mathrm R}^n, {\mathrm R}^m)\Big)$, which is driving me crazy.

I would like to ask:

  1. Why is the total differential not called simply the derivative?
  2. Why is it true that $f'': {\mathrm R}^n \to {\mathscr L}\Big({\mathrm R}^n, {\mathscr L}({\mathrm R}^n, {\mathrm R}^m)\Big)$? I need an intuitive way of understanding what a ${\mathscr L}\Big({\mathrm R}^n, {\mathscr L}({\mathrm R}^n, {\mathrm R}^m)\Big)$ is.

Thanks!

Best Answer

Let $X,Y$ and $Z$ be vector spaces, and $\mathcal{L}(A,B)$ the space of all linear maps from $A$ to $B$.

As noted above, if $F \in \mathcal{L}(X,\mathcal{L}(Y,Z))$, then we can form another map $F_{curry}:X \times Y \to Z$ defined by $F_{curry}(x,y) = F(x)(y)$. Notice that $F_{curry}$ is a bilinear mapping: fixing $x$, $F_{curry}(x,\cdot)$ is linear in the second slot, and $F_{curry} (\cdot,y)$ is linear in the first slot. Conversely, given a bilinear mapping from $G: X \times Y \to Z$, I can produce an element $G_{uncurry}\mathcal{L}(X,\mathcal{L}(Y,Z))$ in the way you would expect: $G_{uncurry}(x)(y) = G(x,y)$. Keyword here is "Curry-howard isomorphism".

So $\mathcal{L}(X,\mathcal{L}(Y,Z))$ can be canonically identified with the space of bilinear mappings from $X \times Y \to Z$. These in tern could be identified with linear mappings from the space $X \otimes Y \to Z$, the so called "tensor product" of $X$ and $Y$, but I will not go into that.

You might be curious how you could work with such an object. What data do you need to write down? For a linear map, you only have to specify action on a basis, but a bilinear map is not a linear map. It turns out (you should check) that specifying the action on all pairs of basis vectors is enough.

Lets get back down to earth and examine a very special case. Let $f:\mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x,y) = x^2y$.

$D(f)\big|_{(x,y)}$ is the linear map given by the matrix $\left[ \begin{matrix} 2xy&x^2\end{matrix} \right]$. That is to say, $D(f)\big|_{(x,y)}(\Delta x,\Delta y) = 2xy\Delta x + x^2\Delta y \approx f(x+\Delta x,y+\Delta y) - f(x,y)$. Notice that the transpose of this matrix is the "gradient" of $f$. Only functions from $\mathbb{R^n} \to \mathbb{R}$ have gradients.

The second derivative should now tell you how much the derivative changes from point to point. If we increment $(x,y)$ by a little bit to $(x+\Delta x,y)$ then we should expect the derivative to increase by about $\left[ \begin{matrix} 2y\Delta x&2x \Delta x\end{matrix} \right]$. Similarly, when we increase $y$ by $\Delta y$, the derivative should change by about $\left[ \begin{matrix} 2x \Delta y&0\Delta y\end{matrix} \right]$.

By linearity, if we change from $(x,y)$ to $(x+\Delta x,y+\Delta y)$, we expect the derivative to change by $$\left[ \begin{matrix} \Delta x&\Delta y\end{matrix} \right] \left[ \begin{matrix} 2y&2x\\2x&0\end{matrix} \right]$$

This gives a matrix which is the approximate change in the derivative. You can then apply this to another vector if you so wish.

Summing it up, if you wanted to see approximately how much the derivative changes from $(x,y)$ to $(x+\Delta x_2,y+\Delta y_2)$ when both are evaluated in the same direction $(\Delta x_1,\Delta y_1)$, you would perform the computation:

$$\left[ \begin{matrix} \Delta x_2&\Delta y_2\end{matrix} \right] \left[ \begin{matrix} 2x&2x\\2x&0\end{matrix} \right] \left[ \begin{matrix} \Delta x_1\\\Delta y_1\end{matrix} \right]$$ The matrix of second partials derived above is called the Hessian, but it a bit misleading to write it as a matrix, since it is really acting as a bilinear form in the manner shown above, i.e. $H(v_1,v_2) = v_1^T H v_2$. You may remember seeing the Hessian arise in multivariable calculus when classifying critical points as maxima, minima, or saddles. In general, the sign eigenvalues of the hessian matrix tell the whole story (although, if there are some zero eigenvalues you might have to climb up the derivative ladder to trilinear forms, etc).

Notice that I only got a Hessian "matrix" because the codomain of $f$ was one dimensional. If it has been, say, $3$ dimensional I would have needed $3$ such matrices, and they would naturally align themselves into a $2\times2\times3$ dimensional box, which would represent a higher order tensor.

Hopefully this gives at least a hint of how to continue. Buzzwords to look for are "multilinear algebra", "tensor products", "tensors", "tensor analysis", and "multivariable taylors theorem".

I do not have a super great reference for this because, even though I do analysis in Several Complex Variables, I have somehow never found a book that treats higher dimensional real analysis really well. I am sure there are books out there, but I have worked out most of this stuff on my own. As far as I know there was never a course offered on it at any university I went to! I guess people are supposed to sort of absorb this stuff when they learn differential geometry.