[Math] motivation for BFGS Hessian update rule

numerical methodsoptimization

The BFGS method approximates Newton's method by replacing the Hessian of a function $f$ with an approximate Hessian $B_k$. At each iteration, the Hessian is improved using the formula in equation five at http://en.wikipedia.org/wiki/BFGS_method.

What is the motivation for this update rule? I understand that the new approximation $B_{k+1}$ is positive-definite and always yields a descent direction, but how is the update rule derived? I tried playing around with Taylor expansions of $f$ but didn't get anywhere.

Best Answer

You can take a look at this slide by John Langford assertion 2. It explains how to come up with a rank-1 matrix that takes a vector to another vector, which is very intuitive algebraically. This alone is not going to yield an updated Hessian matrix close to the previous Hessian, in any sense (Hessians have typically much higher rank than 1). So I guess the idea is to add the rank-1 piece, and then subtract another rank-1 piece from the previous Hessian:

$B_{k+1} = B_k + U - V$, where $U (x_{k+1} - x_k) = \nabla_{x_{k+1}} f - \nabla _{x_k} f$ (the secant condition), and $V (x_{k+1} - x_k) = B_k (x_{k+1} - x_k)$, where $U,V$ both have rank 1.

This seems the most intuitive way of remembering the formula, but there are also derivations based on weighted Frobenius norm $\|B_{k+1} - B_k\|_{F,w}$, see this. Another one based on maximum entropy argument, similar to Bayesian update of Kalman Filter, is here.

I kinda came up with the first derivation idea, which is the easiest mathematically and most useful from implementation point of view (since rank-1 update means we just need to do a dense vector dot product).

I assume you know how to transform from $B_{k+1}$ to $B_{k+1}^{-1}$ easily using Sherman-Morrison formula (see wikipedia), so that part is just linear algebra. Experts in convex optimization are very welcome to correct me.

Best Answer

Related Solutions

[Math] Why is Newton’s method faster than gradient descent

[Math] Which optimization method when Hessian is singular

Related Question