Gradient of multivariate vector-valued function

multivariable-calculuspartial derivativevector analysisvector-spaces

How do you generally define the gradient of a multivariate vector-valued function with respect to two different vectors of different sizes?

My attempt has been (using notation from the Wikipedia page):

Given a vector function $z=f(x,y)$ where $x \in \mathbb R^{m \times 1}$, $y \in \mathbb R^{n \times 1}$, and $z \in \mathbb R^{p \times 1}$ are vectors for $m \neq n$, $n \neq l$, and $l \neq m$,
\begin{equation}
\nabla f(x,y) = \begin{bmatrix} \frac{\partial f}{\partial x}(x,y) \\ \frac{\partial f}{\partial y}(x,y) \end{bmatrix}
\end{equation}

However, the sizes of $\frac{\partial f}{\partial x}(x,y)$ and $\frac{\partial f}{\partial y}(x,y)$ are $(p \times m)$ and $(p \times n)$ respectively, and thus do not have compatible dimensions to be combined into a $(2 \times 1)$ vector like the one shown above. Thus, this definition must be invalid.

What is the correct way of defining the gradient of a function like this? I have only been able to find one other question/source relating to this online, but it does not give a general answer for functions of different sizes vectors. Any help would be very much appreciated.

Best Answer

I always advocate to introduce derivatives (after calculus 101) using vector spaces since it makes every other case a particular case.

Let $\mathrm{U}, \mathrm{V}$ be two normed vector spaces and let $f: \mathrm{U} \to \mathrm{V}$ be any function. We say that $f$ is differentiable at a point $u \in \mathrm{U}$ if $f$ possesses the first order expansion around $u,$ namely, if there exists a continuous linear function $L:\mathrm{U} \to \mathrm{V}$ such that for all $h$ in a neighbourhood of zero in $\mathrm{U},$ $$ f(u + h) = f(u) + L(h) + o(h), $$ where the "litte-oh" notation $o(h)$ stands for a function such that $\lim\limits_{\substack{h \to 0 \\ h \neq 0}} \dfrac{o(h)}{\|h\|} = 0.$

It can be shown that $L$ depends only on $u,$ $f$ and the topologies of the normed spaces $\mathrm{U}$ and $\mathrm{V},$ as such it is convenient to write it as $L = f'(u).$

You are wondering about the case when $\mathrm{U} = \mathrm{U}_1 \times \mathrm{U}_2$ is the product of two normed spaces. In this case, we need to talk about partial derivatives. For a given point $(u_1, u_2),$ introduce the partial functions $f(u_1, \cdot):\mathrm{U}_2 \to \mathrm{V}$ and $f(\cdot, u_2): \mathrm{U}_1 \to \mathrm{V}$ as follows: $$ f(u_1, \cdot):v_2 \mapsto f(u_1, v_2), \quad f(\cdot, u_2): v_1 \mapsto f(v_1, u_2). $$ We also intrduce the canonical injections based at $(u_1, u_2)$ by $j_1:v_1 \mapsto (v_1, u_2)$ and $j_2:v_2 \mapsto (u_1, v_2).$ Then, we can write $$ f(u_1, \cdot) = f \circ j_2, \quad f(\cdot, u_2) = f \circ j_1. $$ The chain rule will show that if $f$ is differentiable at $(u_1, u_2)$ then the partial functions based at $(u_1, u_2)$ are also differentiable. Furthermore, the derivative of the partial functions will be $\partial_{u_1} f = f' \circ j_1'$ and since $j_1 = (0, u_2) + i_1$ where $i_1$ is a linear function $i_2(v_1) = (v_1, 0),$ it can be shown its derivative is $j_1'(h_1) = i_1'(h_1) = (h_1, 0)$ and so $\partial_{u_1} f(h_1) = f'(j_1(u_1)) j_1'(h_1) = f'(u_1, u_2) \cdot (h_1, 0).$ The contionuous linear function $h_1 \mapsto f'(u_1, u_2) \cdot (h_1, 0)$ is known as first partial derivative of $f$ at $(u_1, u_2)$, the second partial derivative of $f$ is defined mutatis mutandis. This allows writing the fundamental relation between the "total" and "partial derivatives" $$ f'(u_1, u_2)\cdot (h_1, h_2) = f'(u_1, u_2) \cdot (h_1, 0) + f'(u_1, u_2) \cdot (0, h_2) = \partial_{u_1} f(h_1) + \partial_{u_2} f(h_2). $$

When all the normed spaces are some Euclidean space (a.k.a. some $\mathbf{R}^n$), then we can identify every linear function with its canonical matrix. Suppose $\mathrm{U}_1 = \mathbf{R}^{p}, \mathbf{U}_2 = \mathbf{R}^q$ and $\mathbf{V} = \mathbf{R}^r.$ Then $\mathrm{U} = \mathbf{R}^{p+q}$ and so $f'(u_1, u_2)$ must be a linear function from $\mathbf{R}^{p+q}$ into $\mathbf{R}^r,$ namely a matrix of type $(r, p + q)$ ($r$ "rows" and $p+q$ "columns"). The above rule states that the first partial derivative corresponds to the first $p$ columns of the total derivative (since $(h_1, 0) \in \mathbf{R}^p \times \{0\}$), and the second partial correspond to the last $q$ columns ($(0, h_2) \in \{0\} \times \mathbf{R}^q$). Thus, $$ \nabla f(x,y) = \left[ \dfrac{\partial f}{\partial x}, \dfrac{\partial f}{\partial y} \right] $$ where the partial notation are matrices of types $(r, p)$ and $(r, q)$ respectively.

Note. Often authors do the following without ever mentioning it. Suppose $f:\mathbf{R}^n \to \mathbf{R}.$ For what I said above, we must have $$ \nabla f = \left[ \partial_{x_1} f, \ldots, \partial_{x_n} f \right] $$ since the matrix representing the derivative of $f$ must represent a linear function from $\mathbf{R}^n$ into $\mathbf{R},$ so it is of type $(1, n).$ However, there is a strong belief that this matrix ought to be a vector, and as such, people write its transpose, hence the confusion you had.

Related Question