Solved – Why do saddle points become “attractive” in Newtonian dynamics

deep learninggradient descentmachine learningneural networksoptimization

I am reading Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization
by Dauphin et. al. and the first paragraph on the second page states the following:

A typical problem for both local minima and saddle-points is that they are often surrounded by plateaus of small curvature in the error. While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, this repulsion can occur
slowly due to the plateau. Second order methods, like the Newton method, are designed to rapidly descend plateaus surrounding local minima by rescaling gradient steps by the inverse eigenvalues of the Hessian matrix. However, the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.

I am unable to understand the last two sentences of this paragraph, more specifically:

  1. I understand that second order methods like Newton method takes the curvature of a loss surface into account while finding the global minima. I am lost when the authors mention …inverse eigenvalues of the Hessian matrix. Is there a more approachable way to understand this?
  2. Why, in general, do saddle points become more attractive to Newtonian methods?

I would highly appreciate if there exist some intuitive explanations to the above questions.

Best Answer

  1. There's not really a more intuitive way to think about this. Suppose that you have the eigendecomposition of the Hessian for $P$ an orthonormal matrix of eigenvectors and $D$ diagonal matrix of eigenvalues. $$ \begin{align} \nabla^2 f(x) &= PDP^\top \\ \left[\nabla^2 f(x)\right]^{-1} &= PD^{-1}P^\top \end{align} $$ This is relevant to Newton's method because the update is given by $$ \begin{align} x^{(t+1)} &= x^{(t)}-\left[\nabla^2 f(x^{(t)})\right]^{-1}\nabla f(x^{(t)}) \\ &= x^{(t)}-PD^{-1}P^\top\nabla f(x^{(t)}) \end{align} $$

  2. Saddle points have gradient 0, and Newton's method seeks points with gradient 0. If the problem is non-convex, then depending on the starting point, you may find yourself in the "basin of attraction" of the saddle point.

I also think that this post is of interest:

Related Question