[Math] Clarification of Textbook Explanation of Hessian Matrix, Directional Second Derivative, and Eigenvalues/Eigenvectors

eigenvalues-eigenvectorshessian-matrixmachine learningmultivariable-calculusvector analysis

My machine learning textbook has the following section on the Hessian matrix:

When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix $\mathbf{H}(f)(\mathbf{x})$ is defined such that

$$\mathbf{H}(f)(\mathbf{x})_{i, j} = \dfrac{\partial^2}{\partial{x_i}\partial{x_j}} f(\mathbf{x}).$$

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the differential operators are commutative; that is, their order can be swapped:

$$\dfrac{\partial^2}{\partial{x_i}\partial{x_j}} f(\mathbf{x}) = \dfrac{\partial^2}{\partial{x_j}\partial{x_i}} f(\mathbf{x}) $$

This implies that $\mathbf{H}_{i, j} = \mathbf{H}_{j, i}$, so the Hessian matrix is symmetric at such points. Most of the functions we encounter in the context of deep learning have a symmetric Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors. The second derivative in a specific direction represented by a unit vector $\mathbf{d}$ is given by $\mathbf{d}^T \mathbf{H} \mathbf{d}$. When $\mathbf{d}$ is an eigenvector of $\mathbf{H}$, the second derivative in that direction is given by the corresponding eigenvalue. For other directions of $\mathbf{d}$, the directional second derivative is a weighted average of all the eigenvalues, with weights between $0$ and $1$, and eigenvectors that have a smaller angle with $\mathbf{d}$ receiving more weight. The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalue determines the minimum second derivative.

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning (Page 84).

I understood everything until this part:

For other directions of $\mathbf{d}$, the directional second derivative is a weighted average of all the eigenvalues, with weights between $0$ and $1$, and eigenvectors that have a smaller angle with $\mathbf{d}$ receiving more weight. The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalue determines the minimum second derivative.

None of this makes any sense to me.

For instance, what does it mean by "other directions of $\mathbf{d}$"? $\mathbf{d}$ is a unit vector and therefore inherently has a direction. So to say "other directions of $\mathbf{d}$" makes no sense?

Also, why does the maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalue determines the minimum second derivative? I've studied elementary linear algebra (including an introduction to eigenvalues and eigenvectors), but this part is not clear to me.

I would greatly appreciate it if people could please take the time to clarify this section.

Best Answer

If you studied elementary linear algebra you may have learned that for a symmetric matrix there is a choice of an orthogonal and normed basis $e_1, \ldots, e_n$ such that the matrix, wrt to this basis, is diagonal with the eigenvalues as entries on the diagonal, and after appropriately ordering the basis vectors, it can be assumed that they are ordered by size, i.e.

$$H = \left( \array{\lambda_1 & 0 & \dots & 0 \\ 0 &\lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 &\dots & 0 & \lambda_n } \right)$$

with $\lambda_1 \ge \lambda_2 \ge \dots\ge \lambda_n$.

If you then choose $d$ among the $e_i$, e.g. $d= e_i$, it should be clear that $$e_i^THe_i = \lambda_i$$

If you choose $d$ as a linear combination of the $e_i$, $d= \sum_i t_ie_i$ (with $0\le t_i\le 1$, since otherwise $d$ will not be normed), it is also easy to see that then $$ d^THd = \sum \lambda_i t_i^2 $$

which is the weighted average your authors are referring to.

From the diagonal representation you can also easily derive the statement about the maximum and minimum value of $d\mapsto d^THd $

Related Question