Deriving hessian as best linear approximation

derivativeshessian-matrixmatricesoptimizationvector analysis

I'm having trouble understanding how to derive the Hessian based on the following definition of the derivative of a multivariable function (the derivation is from "An introduction to optimization 4th edition by Edwin Chong and Stanislaw Zak).

Any linear transformation from $R^n$ to $R^m$, and in particular the derivative $\mathcal{L}$ of $f: R^n \rightarrow R^m$ can be represented by an m x n matrix. To find the matrix representation $L$ of the derivative $\mathcal{L}$ of a differentiate function $f: R^n \rightarrow R^m$ we use the natural basis $\{e_i,…, e_n\}$ for $R^n$. Consider the vectors. $$ x_j = x_0 + te_j, \quad j = 1, …., n $$ By the definition of the derivative we have
$$ \lim_{t \to 0} \frac{f(x_j) – (tLe_j + f(x_0))}{t} = 0 $$
for j = 1,…,n. This means that
$$ \lim_{t \to 0} \frac{f(x_j) – f(x_0)}{t} = Le_j $$
But $Le_j$ is the jth column of the matrix $L$. On the other hand, the vector $x_j$ differs from $x_0$ only in the jth coordinate, and in that coordinate the difference is just the number t. Therefore, the left side of the preceding equation is the partial derivative
$$ \frac{\partial f}{\partial x_j}(x_0)$$
Because vector limits are computed by taking the limit of each coordinate function, it follows that if
$$
f(x) =
\begin{bmatrix}
f_1(x) \\
\vdots \\
f_n(x) \\
\end{bmatrix}
$$

then
$$
\frac{df}{dx_j}(x_0) =
\begin{bmatrix}
\frac{df_1}{dx_j}(x_0) \\
\vdots \\
\frac{df_n}{dx_j}(x_0) \\
\end{bmatrix}
$$

And the matrix $L$ has the form
$$
\begin{bmatrix}
\frac{df_1}{dx_1}(x_0) \cdots \frac{df_1}{dx_n}(x_0) \\
\vdots \\
\frac{df_m}{dx_1}(x_0) \cdots \frac{df_m}{dx_n}(x_0) \\
\end{bmatrix}
$$

All of that makes sense, and I understand that if we assume that $f: R^n \rightarrow R$ then we get the row vector
$$ L =
\begin{bmatrix}
\frac{df}{dx_1}(x_0), \cdots, \frac{df}{dx_n}(x_0)
\end{bmatrix}
$$

which is the best linear approximation for $f$ around $x_0$ in the sense that

$$ f(x) \approx f(x_0) + L(x_0)(x – x_0)$$

which is the first two terms of the Taylor series centered at $x_0$.

However, the book then goes on the define the gradient as $\nabla f = Df(x)^T$ and then defines the Hessian as the derivate of the gradient

$$ D^2f = D(\nabla f) $$

It's this "defining the Hessian as the derivative of the gradient" that I don't understand.
Given the definition of the derivative they presented earlier as the best linear map (lets call this new map K), I would assume that $$ D^2f = \lim_{t \to 0} \frac{Df(x_0 + te_j) – Df(x_0)}{t} = Ke_j$$

But the left-hand side of this equation is a row vector, and the right-hand side is a column vector, which doesn't really make any sense. So my question is how do I use the definition of "best linear approximation" from above to differentiate $Df(x)$ and arrive at the Hessian?

I think I might also have a misunderstanding about which part of $Df(x)$ is the "function". I usually think of functions as mapping from some input space to some output space, and derivates are also just functions from the same input space to the same output space, ie.

$$ f(x) = x^2$$
$$ f'(x) = 2x $$

Where the derivate itself is a function that maps inputs to outputs. But if I assume that $Df(x) = $$ L =
\begin{bmatrix}
\frac{df}{dx_1}(x_0), \cdots, \frac{df}{dx_n}(x_0)
\end{bmatrix}
$
then this isn't a function as I normally understand it, since $f: R^n \rightarrow R$ but $L$ takes in some vector $x \in R^n $ and spits out a vector $L_0 \in R^n$. So if I'm wrong in understanding what the derivate of a multivariable function is I would also appreciate clearing that up.

As for research I've done:

This question derives the Hessian using the limit definition of the gradient, but as stated above I'm not sure why we need to switch to the gradient to define the Hessian in the first place.

This question seems to insert a transpose into the derivative to make the numbers match the Hessian, which I don't understand the reasoning for.

I've also done some reading on Frechet derivatives, which seems similar to/is the same as the "best linear map" definition presented above, but I wasn't able to find anything that answered my question with that line of research either.

Best Answer

$ \newcommand\R{\mathbb R} \newcommand\Hom{\mathrm{Hom}} $I think you are getting confused by the use of matrices/coordinates here, which are completely unecessary. So let us do away with them. There's is also no reason to restrict our discussion to $\R^n$ since virtually nothing changes if just use a general vector space, and in fact we will need this

So let $V, W$ be finite-dimensional (real) normed vector spaces with norms $||\cdot||_V$ and $||\cdot||_W$. The choice of norms is actually irrelevant since all norms on a given finite dimensional vector space generate the same topology, meaning in particular that the notion of convergence is independent of the norm chosen. Now let $f : V \to W$ and choose $x \in V$. The differential of $f$ at $x$ (specifically called the Fréchet derivative, but I would personally not prefer the term "derivative" here) is the unique linear transformation $Df_x : V \to W$ which "best approximates $f$"; specifcally $$ \lim_{||h||_V \to 0}\frac{||f(x+h) - f(x) - Df_x(h)||_W}{||h||_V} = 0, $$ or in other words $[f(x+h)-f(x)]/||h||_V$ converges to $Df_x(h)/||h||_V$ under the norm $||\cdot||_W$ as $||h||_V \to 0$. For $\epsilon \in \R$ and fixed $h$ it follows that $$ Df_x(h) = \lim_{\epsilon\to0}\frac{f(x+\epsilon h)-f(x)}\epsilon. $$ Another way to state this is that if $||h||_V$ is "small enough" then $$ f(x + h) \approx f(x) + Df_x(h). $$ In this sense $Df_x(h)$ is the "change of $f$ in the direction $h$".

If we were to chooses bases for $V, W$, then we could represent out vectors as column vectors and $Df_x$ as a matrix. If $V = \R^m$ and $W = \R^n$ and we choose the standard bases, the matrix of $Df_x$ is the Jacobian matrix.

The space $\Hom(V, W)$ of linear transformations is itself a finite dimensional vector space, so we can also differentiate with it. In particular $x \mapsto Df_x$ is a function $V \to \Hom(V, W)$; Thus the second differential $D^2f_x$ of $f$ at $x$ is a multilinear function $V\times V \to W$ defined by $$ D^2f_x(h, k) = D[y \mapsto Df_y]_x(h)(k). $$ Note that $D[y \mapsto Df_y]_x : V \to \Hom(V, W)$, so first we evaluate on $h$ to get $D[y \mapsto Df_y]_x(h) : V \to W$, and then on $k$ to finally get $D^2f_x(h, k) \in W$. This is the best second-order, bilinear approximation to $f$, in the sense that if $h, k \in V$ with $||h||_V, ||k||_V$ "small" then simply using the definition of the differential gives $$ f(x + k + h) \approx f(x + k) + Df_{x+k}(h) \approx f(x) + Df_x(k) + Df_x(h) + D^2f_x(h,k). $$ We can also unravel the definition of $D^2f_x$ to get a limit expression for $D^2f_x(h,k)$ with fixed $h, k$: $$\begin{aligned} D^2f_x(h,k) &= \lim_{\eta\to0}\frac{Df_{x+\eta k}(h) - Df_x(h)}\eta \\ &= \lim_{\eta\to0}\lim_{\epsilon\to0} \frac{f(x+\eta k+\epsilon h) - f(x+\eta k) - f(x+\epsilon h) + f(x)} {\epsilon\eta}. \end{aligned}$$ In the particular case that $k = h$, we take $\eta = \epsilon$ to get $$ D^2f_x(h,h) = \lim_{\epsilon\to0} \frac{f(x+2\epsilon h) - 2f(x+\epsilon h) + f(x)} {\epsilon^2}. $$

It is uninteresting to differentiate $y \mapsto Df_x(y)$ because the differential of any linear function is itself; that is, if $f$ is linear then $$ Df_x(h) = f(h), $$ so in particular $$ D[y \mapsto Df_x(y)]_z(k) = Df_x(k). $$

Another name for multilinear function is tensor. When $V = W$, $D^2f_x$ is a type $(1,2)$-tensor (it takes in two vectors and spits out one); when $W = \R$, it is a type $(0,2)$-tensor; and generally when $W = V^{\otimes k}$ it is a type $(k,2)$-tensor.

Linear functions $V \to V$ are type $(1,1)$-tensors, so typically this is how matrices are thought of as well. However, given a $(0,2)$-tensor $T$, we can form a matrix $\mathbf T$ whose entries are $T(e_i, e_j)$ where $\{e_i\}_{i=1}^m$ is a basis for $V$; given vector $v_1, v_2 \in V$ expressed as column vectors in this basis, $$ T(v_1, v_2) = v_1^{\top}\mathbf Tv_2. $$

Now take $V = \R^m$ and $W = \R$ so that $f : \R^m \to \R$. Then $D^2f_x : \R^m\times\R^m \to \R$; the matrix corresponding to this $(0,2)$-tensor expressed in the standard basis of $\R^m$ is the Hessian.


The moral of the above story is this: the differential of some $f$ gives us a function $x \mapsto Df_x$, so we can also differentiate this to get $D^2f_x$. Both applications of $D$ also need a direction to differentiate along, so $D^2f_x$ has two arguments giving us $D^2f_x(h, k)$. If $f$ is a function taking vectors to scalars, then it turns out that there is a matrix $\mathbf H_x$ (dependent on $x$) such that $$ D^2f_x(h, k) = h^\top\mathbf H_x k, $$ and we call $\mathbf H$ the Hessian of $f$.