Question on Hessian Matrix and Gradient Descent Step

gradient descenthessian-matrixnumerical methodsnumerical optimizationoptimization

Some context before my question

I'm trying to make some object tracking software where the tracking is achieved by minimizing an energy function. The energy is a function of the pose parameters of the object. That means I'm trying to find those parameters that give me a transformation matrix (rotation and translation) that minimizes the energy. I express my transformations using twist coordinates(from screw theory) as parameters.


Minimizing that function led me to some numerical analysis/ optimization methods which I'm not familiar with. Basically, I first came across gradient descent. I get that gradient descent gives me the vector $V$ that points towards the local minima. The tricky part is the step. I wanted to find a coefficient to multiply $V$ so that convergence is fast and without overstepping. That's where Newton's method came to play and I realized that multiplying my $V$ with the inverse of the Hessian of the function $H^{-1}$
I make a good step towards the minima. However, this is giving me weird results so I'm trying to debug my code and see what's wrong.

Of course, the debugging is my part. But as I'm trying to find what's wrong I came up with a question. Instead of using a coefficient to limit my step, I now use the matrix $H^{-1}$ and multiply that by $V$. Doesn't the result have a different direction from the initial $V$? That means we don't travel along the initial direction. So, if $H^{-1}$ doesn't actually dictate the step, what does it do and helps us in this case? How does it correct the initial vector $V$ ?

I think after applying gradient descent for one iteration, in the second iteration I would end up with another $V$. Is the Hessian kind of "predicting" the next vectors of gradient descent and trying to give us one vector that combines them?

I hope I explained it well enough.


TLDR

The basic Newton method does this: $x_{x+1}=x_k -af'(x)$ where a is the step. A better iteration would be $x_{k+1}=x_k-\frac{f'(x)}{f''(x)}$ which for multivariable functions leads to $\vec X_{k+1}=\vec X_k-H^{-1}\nabla f$ Multiplying the step by $H^-1$ is not as simple as multiplying by a coefficient $a$. It doesn't just limit the step, it changes its direction. So, how exactly does $H^{-1}$ change $\nabla f$ to a better step towards the local minima?

Best Answer

Using quadratic approximation to the function near the minimum, we can write:

$f(\mathbf{x}_k) = \frac{1}{2} (\mathbf{x}_k - \mathbf{x}_0)^T H (\mathbf{x}_k - \mathbf{x}_0)$

where the Hessian matrix is assumed positive definite ( so that we have a minimum at $\mathbf{x}_0 $ )

Take the gradient of the above,

$ \nabla f = H (\mathbf{x}_k - \mathbf{x}_0) $

From which it follows that,

$ \mathbf{x}_0 = \mathbf{x}_k - H^{-1} \nabla f $

And this is the best approximation of the minimum given a quadratic approximation to the function.

So we'll take,

$ \mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1} \nabla f $

then repeat the process again and again with a new $H$ and a new $\nabla f$.