Hessian of a function with a symmetric matrix

hessian-matrixmatricesmultivariable-calculusoptimization

I have seen the following function defined

$ f(x,y) = \begin{bmatrix} x-x_0 \\ y-y_0 \end{bmatrix}^T A \begin{bmatrix} x-x_0 \\ y-y_0 \end{bmatrix}$

for a symmetric matrix A.

It then states the Hessian of the function is $H = A + A^T = 2A$

I understand that if a matrix is symmetric then $A + A^T = 2A$. However, I am completely confused how this Hessian is achieved. Can somebody explain why this is a property?

Best Answer

If $A=\left[\begin{smallmatrix}a&b\\c&d\end{smallmatrix}\right]$, then\begin{multline}f(x,y)=ax^2-2axx_0+ax_0^{\,2}+bxy+cxy-bx_0y-cx_0y+\\+dy^2-bxy_0-cxy_0+bx_0y_0+cx_0y_0-2dyy_0+dy_0^{\,2}.\end{multline}and the Hessian of this expression is$$\begin{bmatrix}2a&b+c\\b+c&2d\end{bmatrix}.$$And this is equal to$$\begin{bmatrix}a&c\\b&d\end{bmatrix}+\begin{bmatrix}a&b\\c&d\end{bmatrix}.$$

Related Solutions

[Math] Convex function with non-symmetric Hessian

It turns out that twice-differentiability implies that the Hessian is symmetric even without convexity and with no reference to whether the second-order partial derivatives are continuous! The proof below is based on Theorem 8.12.2 in the book Foundations of Modern Analysis by Dieudonné (1969, p. 180).

Claim: Let $U\subseteq\mathbb R^n$ be an open set and $f:U\to\mathbb R$ a function. Suppose that $f$ is (Fréchet) differentiable on $U$ and that it is twice (Fréchet) differentiable at $\mathbf x_0\in U$. Then, the Hessian matrix $\mathbf H(\mathbf x_0)$ at $\mathbf x_0$ is symmetric.

Proof: Let $\mathbf D:U\to\mathbb R^n$ denote the gradient function of $f$. Fix $\varepsilon>0$. Since $\mathbf D$ is Fréchet differentiable at $\mathbf x_0$ by assumption, it follows that there exists some $\delta>0$ such that $\|\mathbf v\|<4\delta$ implies that $$\left\|\mathbf D(\mathbf x_0+\mathbf v)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot\mathbf v\right\|\leq\varepsilon\|\mathbf v\|.$$ There is no loss of generality in taking $\delta$ to be so small that the open ball $B(4\delta,\mathbf x_0)$ is contained in the open set $U$.

For any $i,j\in\{1,\ldots,n\}$, let $\mathbf e_i$ and $\mathbf e_j$ be the corresponding standard basis vectors of unit length. Let $\mathbf s\equiv\delta\mathbf e_i$ and $\mathbf t\equiv\delta\mathbf e_j$. It is clear that $\mathbf x_0+\xi\mathbf s+\mathbf t$ and $\mathbf x_0+\xi\mathbf s$ are both in $U$ whenever $\xi\in[0,1]$; this is because $\|\xi\mathbf s+\mathbf t\|<4\delta$ and $\|\xi\mathbf s\|<4\delta$. Define the following function $g:[0,1]\to\mathbb R$: $$g(\xi)\equiv f(\mathbf x_0+\xi\mathbf s+\mathbf t)-f(\mathbf x_0+\xi\mathbf s)\quad\forall\xi\in[0,1].$$

Clearly, $g$ is continuous on $[0,1]$ and differentiable on $(0,1)$. Lagrange's mean-value theorem, in turn, implies that there exists some $\xi\in(0,1)$ such that $$g(1)-g(0)=g'(\xi)=\mathbf s\cdot\left[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0+\xi\mathbf s)\right],$$ using the chain rule.

Next, one can derive the following chain of inequalities (the first one uses the Cauchy–Schwarz inequality): \begin{align*} &\left|g(1)-g(0)-\mathbf s\cdot\mathbf H(\mathbf x_0)\cdot\mathbf t\right|\leq\underbrace{\|\mathbf s\|}_{=\delta}\left\|[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0)]-[\mathbf D(\mathbf x_0+\xi\mathbf s)-\mathbf D(\mathbf x_0)]-\mathbf H(\mathbf x_0)\cdot\mathbf t\right\|\\ =&\,\delta\left\|[\mathbf D(\mathbf x_0+\xi\mathbf s+\mathbf t)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot(\xi\mathbf s+\mathbf t)]-[\mathbf D(\mathbf x_0+\xi\mathbf s)-\mathbf D(\mathbf x_0)-\mathbf H(\mathbf x_0)\cdot(\xi\mathbf s)]\right\|\\ \leq&\,\delta\varepsilon\left(\|\xi\mathbf s+\mathbf t\|+\|\xi\mathbf s\|\right)<8\delta^2\varepsilon. \end{align*} That is, one has that $$|f(\mathbf x_0+\mathbf s+\mathbf t)-f(\mathbf x_0+\mathbf s)-f(\mathbf x_0+\mathbf t)+f(\mathbf x_0)-\delta^2\mathbf e_i\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_j|<8\delta^2\varepsilon,$$ and, by a completely analogous and symmetric reasoning in which $\mathbf s$ and $\mathbf t$ are interchanged, $$|f(\mathbf x_0+\mathbf s+\mathbf t)-f(\mathbf x_0+\mathbf s)-f(\mathbf x_0+\mathbf t)+f(\mathbf x_0)-\delta^2\mathbf e_j\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_i|<8\delta^2\varepsilon.$$ Given that $\mathbf e_i\cdot\mathbf H(\mathbf x_0)\cdot\mathbf e_j=h_{ij}(\mathbf x_0)\equiv\partial^2 f/(\partial x_i\partial x_j)(\mathbf x_0)$, the preceding two inequalities imply that $$\left|h_{ij}(\mathbf x_0)-h_{ji}(\mathbf x_0)\right|<16\varepsilon.$$ Taking $\varepsilon$ to be arbitrarily small, one sees that $h_{ij}(\mathbf x_0)=h_{ji}(\mathbf x_0)$. $\blacksquare$

Deriving hessian as best linear approximation

$ \newcommand\R{\mathbb R} \newcommand\Hom{\mathrm{Hom}} $I think you are getting confused by the use of matrices/coordinates here, which are completely unecessary. So let us do away with them. There's is also no reason to restrict our discussion to $\R^n$ since virtually nothing changes if just use a general vector space, and in fact we will need this

So let $V, W$ be finite-dimensional (real) normed vector spaces with norms $||\cdot||_V$ and $||\cdot||_W$. The choice of norms is actually irrelevant since all norms on a given finite dimensional vector space generate the same topology, meaning in particular that the notion of convergence is independent of the norm chosen. Now let $f : V \to W$ and choose $x \in V$. The differential of $f$ at $x$ (specifically called the Fréchet derivative, but I would personally not prefer the term "derivative" here) is the unique linear transformation $Df_x : V \to W$ which "best approximates $f$"; specifcally $$ \lim_{||h||_V \to 0}\frac{||f(x+h) - f(x) - Df_x(h)||_W}{||h||_V} = 0, $$ or in other words $[f(x+h)-f(x)]/||h||_V$ converges to $Df_x(h)/||h||_V$ under the norm $||\cdot||_W$ as $||h||_V \to 0$. For $\epsilon \in \R$ and fixed $h$ it follows that $$ Df_x(h) = \lim_{\epsilon\to0}\frac{f(x+\epsilon h)-f(x)}\epsilon. $$ Another way to state this is that if $||h||_V$ is "small enough" then $$ f(x + h) \approx f(x) + Df_x(h). $$ In this sense $Df_x(h)$ is the "change of $f$ in the direction $h$".

If we were to chooses bases for $V, W$, then we could represent out vectors as column vectors and $Df_x$ as a matrix. If $V = \R^m$ and $W = \R^n$ and we choose the standard bases, the matrix of $Df_x$ is the Jacobian matrix.

The space $\Hom(V, W)$ of linear transformations is itself a finite dimensional vector space, so we can also differentiate with it. In particular $x \mapsto Df_x$ is a function $V \to \Hom(V, W)$; Thus the second differential $D^2f_x$ of $f$ at $x$ is a multilinear function $V\times V \to W$ defined by $$ D^2f_x(h, k) = D[y \mapsto Df_y]_x(h)(k). $$ Note that $D[y \mapsto Df_y]_x : V \to \Hom(V, W)$, so first we evaluate on $h$ to get $D[y \mapsto Df_y]_x(h) : V \to W$, and then on $k$ to finally get $D^2f_x(h, k) \in W$. This is the best second-order, bilinear approximation to $f$, in the sense that if $h, k \in V$ with $||h||_V, ||k||_V$ "small" then simply using the definition of the differential gives $$ f(x + k + h) \approx f(x + k) + Df_{x+k}(h) \approx f(x) + Df_x(k) + Df_x(h) + D^2f_x(h,k). $$ We can also unravel the definition of $D^2f_x$ to get a limit expression for $D^2f_x(h,k)$ with fixed $h, k$: $$\begin{aligned} D^2f_x(h,k) &= \lim_{\eta\to0}\frac{Df_{x+\eta k}(h) - Df_x(h)}\eta \\ &= \lim_{\eta\to0}\lim_{\epsilon\to0} \frac{f(x+\eta k+\epsilon h) - f(x+\eta k) - f(x+\epsilon h) + f(x)} {\epsilon\eta}. \end{aligned}$$ In the particular case that $k = h$, we take $\eta = \epsilon$ to get $$ D^2f_x(h,h) = \lim_{\epsilon\to0} \frac{f(x+2\epsilon h) - 2f(x+\epsilon h) + f(x)} {\epsilon^2}. $$

It is uninteresting to differentiate $y \mapsto Df_x(y)$ because the differential of any linear function is itself; that is, if $f$ is linear then $$ Df_x(h) = f(h), $$ so in particular $$ D[y \mapsto Df_x(y)]_z(k) = Df_x(k). $$

Another name for multilinear function is tensor. When $V = W$, $D^2f_x$ is a type $(1,2)$-tensor (it takes in two vectors and spits out one); when $W = \R$, it is a type $(0,2)$-tensor; and generally when $W = V^{\otimes k}$ it is a type $(k,2)$-tensor.

Linear functions $V \to V$ are type $(1,1)$-tensors, so typically this is how matrices are thought of as well. However, given a $(0,2)$-tensor $T$, we can form a matrix $\mathbf T$ whose entries are $T(e_i, e_j)$ where $\{e_i\}_{i=1}^m$ is a basis for $V$; given vector $v_1, v_2 \in V$ expressed as column vectors in this basis, $$ T(v_1, v_2) = v_1^{\top}\mathbf Tv_2. $$

Now take $V = \R^m$ and $W = \R$ so that $f : \R^m \to \R$. Then $D^2f_x : \R^m\times\R^m \to \R$; the matrix corresponding to this $(0,2)$-tensor expressed in the standard basis of $\R^m$ is the Hessian.

The moral of the above story is this: the differential of some $f$ gives us a function $x \mapsto Df_x$, so we can also differentiate this to get $D^2f_x$. Both applications of $D$ also need a direction to differentiate along, so $D^2f_x$ has two arguments giving us $D^2f_x(h, k)$. If $f$ is a function taking vectors to scalars, then it turns out that there is a matrix $\mathbf H_x$ (dependent on $x$) such that $$ D^2f_x(h, k) = h^\top\mathbf H_x k, $$ and we call $\mathbf H$ the Hessian of $f$.

Best Answer

Related Solutions

[Math] Convex function with non-symmetric Hessian

Deriving hessian as best linear approximation

Related Question