[Math] the motivation/application of dual spaces and transposes

linear algebra

I've always been baffled as to where transposes come from. I found this question, but the answer isn't satisfying to me – the idea seems to be "dual spaces are important, and you can define transposes using those". This leaves two questions:

  1. Why are dual spaces important?
  2. Whatever it is that we want to do with dual spaces, how does the transpose help us accomplish that?

For point (1) my linear algebra teacher told me something else that I didn't find quite satisfying, which is that if you're interested in linear transformations, then dual spaces are the "simplest kind" of linear transformation. This is quite vague though… what actual problems might we want to solve in which the concept of the dual space would arise naturally? And how would the concept of a transpose arise naturally from those?

Best Answer

One way to motivate dual spaces and transposes is to consider differentiation of scalar-valued functions of several variables. The basic point is that functionals are the easiest functions to deal with short of constant functions, so that differentiation is essentially approximation by a unique functional such that the error in the approximation is sufficiently well behaved. Moreover, transposes arise naturally when differentiating, say, the composition of a scalar-valued function with a change of coordinates.

Let $f : (a,b) \to \mathbb{R}$. Conventionally, one defines $f$ to be differentiable at $x \in (a,b)$ if the limit $$ \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} $$ exists, in which case the value of that limit is defined to be the derivative $f^\prime(x)$ of $f$ at $x$. Observe, however, that this definition means that for $h$ small enough, $$ f(x+h)-f(x) = f^\prime(x)h + R_x(h), $$ where $h \to f^\prime(x)h$ defines a linear transformation $df_x :\mathbb{R} \to \mathbb{R}$ approximating $f$ near $x$, and where the error term $R_x(h)$ satisfies $$ \lim_{h \to 0} \frac{R_x(h)}{h} = 0. $$ In fact, $f$ is differentiable at $x$ if and only if there exists a linear transformation $T : \mathbb{R} \to \mathbb{R}$ such that $$ \lim_{h \to 0} \frac{\lvert f(x+h) - f(x) - T(h) \rvert}{\lvert h \rvert} = 0, $$ in which case $df_x := T$ is unique, and given by multiplication by the scalar $f^\prime(x) = T(1)$.

Now, let $f : U \to \mathbb{R}^m$, where $U$ is an open subset of $\mathbb{R}^n$. Then, we can still perfectly define $f$ to be differentiable at $x \in U$ if and only if there exists a linear transformation $T : \mathbb{R}^n \to \mathbb{R}^m$ such that $$ \lim_{h \to 0} \frac{\| f(x+h) - f(x) - T(h) \|}{\|h\|} = 0, $$ in which case $df_x := T$ is unique; in particular, for $\|h\|$ small enough, $$ f(x+h) - f(x) = df_x(h) + R_x(h), $$ where $df_x$ gives a linear approximation of $f$ near $x$, such that the error term $R_x(h)$ satisfies $$ \lim_{h \to 0} \frac{R_x(h)}{\|h\|} = 0. $$

At last, let's specialise to the case where $f : U \to \mathbb{R}$, i.e., where $m=1$. If $f$ is differentiable at $x$, then $df_x : \mathbb{R}^n \to \mathbb{R}$ is linear, and hence $df_x \in (\mathbb{R}^n)^\ast$ by definition. In particular, for any $v \in \mathbb{R}^n$, the directional derivative $$ \nabla_v f(x) := \lim_{\epsilon \to 0} \frac{f(x+\epsilon v) - f(x)}{\epsilon} $$ exists and is given by $$ \nabla_v f(x) = (d_x f)(v). $$ Moreover, the gradient of $f$ at $x$ is exactly the unique vector $\nabla f(x) \in \mathbb{R}^n$ such that $$ \forall v \in \mathbb{R}^n, \quad (d_x f)(v) = \langle \nabla f(x), v \rangle. $$ In any event, the derivative of a scalar-valued function of $n$ variables at a point is most naturally understood as a functional on $\mathbb{R}^n$, i.e., as an element of $(\mathbb{R}^n)^\ast$.

Now, suppose, for simplicity, that $f : \mathbb{R}^n \to \mathbb{R}$ is everywhere-differentiable, and let $S : \mathbb{R}^p \to \mathbb{R}^n$ be a linear transformation, e.g., a coordinate change $\mathbb{R}^n \to \mathbb{R}^n$. Then $f \circ S$ is indeed everywhere differentiable with derivative $$d_y(f \circ S) = (d_{Sy} f) \circ S = S^t d_{Sy} f,$$ at $y \in \mathbb{R}^p$. On the one hand, if $S = 0$, then $f \circ S = f(0)$ is constant, so that $d_y(f \circ S) = 0 = S^t d_{Sy} f$, as required. On the other hand, if $S \neq 0$, so that $$ \|S\| := \sup_{k \neq 0} \frac{\|Sk\|}{\|k\|} > 0, $$ it follows that $$ \frac{\|(f \circ S)(y+k)-(d_{Sy} f \circ S)(k)\|}{\|k\|} = \frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|k\|} \leq \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|} \to 0, \quad k \to 0 $$ by differentiability of $f$ at $Sy$ and continuity of the map $$ k \mapsto \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|}. $$ More concretely, once you know that $f \circ S$ is differentiable everywhere, then for each $v \in \mathbb{R}^n$, by linearity of $S$, $$ (f \circ S)(y + \epsilon v) = f(Sy + \epsilon Sk), $$ so that, indeed $$ \left(d_y(f \circ S)\right)(k) = \nabla_k(f \circ S)(y) = \nabla_{Sk}f(Sy) = (d_{Sy}f)(Sk) = (S^t d_{Sy}f)(k). $$ In general, if $S : \mathbb{R}^p \to \mathbb{R}^n$ is everywhere differentiable (again, for simplicity), then $$ d_y (f \circ S) = (d_{Sy}f) \circ d_y S = (d_y S)^t d_{Sy}f, $$ which is none other than the relevant case of the chain rule.