[Math] How to solve the matrix minimization for BFGS update in Quasi-Newton optimization

algorithmscalculusnumerical methodsoptimizationreal-analysis

I am interested in deriving the unique solution to the following Quasi-Newton minimization problem
$$
\min_{H}|| H-H_k||\\
H=H^\top,\quad Hy_k =s_k
$$ The norm is the weighted Frobenius norm
$$
||A||_W \equiv ||W^{1/2}A W^{1/2}||_F
$$where the weight matrix $W$ is any positive definite matrix that satisfies $Ws_k=y_k$, and $||\cdot||_F$ is defined by $||C||^2_F= \sum_{i,j}^n c^2_{ij}$. The quantity $H$ is the inverse hessian which is symmetric, positive definite and satisfies the secant equation above, $Hy_k=s_k$. We can assume that $W=G^{-1}_k $ where $G_k$ the average hessian is given by
$$
G_k= \int_0^1 \nabla^2 f(x_k+\tau \alpha_k p_k)d\tau
$$
The unique solution is given by
$$
H_{k+1} = (1-\rho_k s_k y^\top_k)H_k (1-\rho_k y_k s^\top_k)+ \rho_k s_k s^\top_k
$$ where $\rho_k = (y^\top_k s_k)^{-1}$. Note, this is an iterative scheme where $k$ represents the current iteration and $H_{k+1}$ is an approximation to the inverse hessian.

The notation I am using is directly from the book Nocedal & Wright – Numerical Optimization. I am not able to find a full derivation of this anywhere, everything written above is all that Nocedal/Wright has in regards to this topic. For a reference, this is in chapter 6 – Quasi Newton Methods, of their newest/2nd edition.

All links I have tried to google and other books also have no full derivation. I am not able to find anything more thorough than Nocedal & Wright however they still don't have a derivation. Thanks

Best Answer

I'll outline the major steps. See the explanations and answers to the comments below.

Introduce some notations $$ \hat H=W^{1/2}HW^{1/2},\quad \hat H_k=W^{1/2}H_kW^{1/2},\quad \hat s_k=W^{1/2}s_k,\quad \hat y_k=W^{-1/2}y_k.\tag{1} $$ Then the problem becomes $$ \min\|\hat H_k-\hat H\|_F\quad\text{subject to }\hat H\hat y_k=\hat y_k\ (=\hat s_k). $$
To use the fact that $\hat y_k$ is the eigenvector of $\hat H$, let us introduce the new orthonormal basis $$ U=[u\ |\ u_\bot] $$ where $u$ is the normalized $\hat y_k$, i.e. $$u=\frac{\hat y_k}{\|\hat y_k\|}\tag{2},$$ and $u_\bot$ is any ON-complement to $u$. Since the Frobenius norm is unitary invariant (as it is the sum of squares of the singular values) we have $$ \|\hat H_k-\hat H\|_F=\|U^T\hat H_kU-U^T\hat HU\|_F= \left\|\begin{bmatrix}\color{blue}* & \color{blue}*\\\color{blue}* & \color{red}*\end{bmatrix}-\begin{bmatrix}\color{blue}1 & \color{blue}0\\\color{blue}0 & \color{red}*\end{bmatrix}\right\|_F. $$ The blue part cannot be affected by optimization, and to minimize the Frobenius norm, it is clear that we should make the red part become zero, that is, the optimal solution satisfies $$ \color{red}{u_\bot^T\hat Hu_\bot}=\color{red}{u_\bot^T\hat H_ku_\bot}. $$ It gives the optimal solution to be $$ \hat H=U\begin{bmatrix}\color{blue}1 & \color{blue}0\\\color{blue}0 & \color{red}{u_\bot^T\hat H_ku_\bot}\end{bmatrix}U^T. $$
To write it more explicitly $$ \hat H=\begin{bmatrix}u & u_\bot\end{bmatrix}\begin{bmatrix}1 & 0\\0 & u_\bot^T\hat H_ku_\bot\end{bmatrix}\begin{bmatrix}u^T \\ u_\bot^T\end{bmatrix}=uu^T+u_\bot u_\bot^T\hat H_ku_\bot u_\bot^T= uu^T+(I-uu^T)\hat H_k(I-uu^T) $$ where we used the following representation of the projection operator to the complement of $u$ $$ u_\bot u_\bot^T=I-uu^T. $$
Changing variables back to the original ones (1), (2) is straightforward.

Explanations:

Step 1. The convenience for $\hat H$ and $\hat H_k$ comes directly from the problem $$ \min\|\underbrace{W^{1/2}H_kW^{1/2}}_{\hat H_k}-\underbrace{W^{1/2}HW^{1/2}}_{\hat H}\|_F. $$ Then we have to rewrite the data too \begin{align} Hy_k=s_k\quad&\Leftrightarrow\quad \underbrace{\color{blue}{W^{1/2}}H\color{red}{W^{1/2}}}_{\hat H}\underbrace{\color{red}{W^{-1/2}}y_k}_{\hat y_k}=\underbrace{\color{blue}{W^{1/2}}s_k}_{\hat s_k},\\ Ws_k=y_k\quad&\Leftrightarrow\quad \underbrace{\color{red}{W^{-1/2}}Ws_k}_{\hat s_k}=\underbrace{\color{red}{W^{-1/2}}y_k}_{\hat y_k}. \end{align} Thus, $\hat H\hat y_k=\hat y_k$. It is also equal to $\hat s_k$.

Step 2. Since $\hat Hu=u$ we know that $u^T\hat Hu=u^Tu=1$ and $u_\bot^THu=u_\bot^Tu=0$, so we can represent the optimizing variable as $$ U^T\hat HU=\begin{bmatrix}u^T\\ u_\bot^T\end{bmatrix}\hat H\begin{bmatrix}u & u_\bot\end{bmatrix}= \begin{bmatrix}u^T\hat Hu & u^T\hat Hu_\bot\\u_\bot^T\hat Hu & u_\bot^T\hat Hu_\bot\end{bmatrix}= \begin{bmatrix}1 & 0\\0 & u_\bot^T\hat Hu_\bot\end{bmatrix}. $$ It gives the following \begin{align} U^T\hat H_kU-U^T\hat HU&=\begin{bmatrix}u^T\\ u_\bot^T\end{bmatrix}\hat H_k\begin{bmatrix}u & u_\bot\end{bmatrix}-\begin{bmatrix}u^T\\ u_\bot^T\end{bmatrix}\hat H\begin{bmatrix}u & u_\bot\end{bmatrix}=\\ &=\begin{bmatrix}\color{blue}{u^T\hat H_ku} & \color{blue}{u^T\hat H_ku_\bot}\\\color{blue}{u_\bot^T\hat H_ku} & \color{red}{u_\bot^T\hat H_ku_\bot}\end{bmatrix}-\begin{bmatrix}\color{blue}{1} & \color{blue}{0}\\\color{blue}{0} & \color{red}{u_\bot^T\hat Hu_\bot}\end{bmatrix}. \end{align} This particular structure of the optimizing variable was the whole idea to switch to the new basis. Because $\hat H$ has no freedom to vary in the blue part, we cannot change the corresponding blue part of $\hat H_k$, so it is fixed for all possible $\hat H$. The red part though can be changed as we wish, and the smallest Frobenius norm \begin{align} \|U^T(\hat H_k-\hat H)U\|_F^2&= \left\|\begin{bmatrix}\color{blue}{u^T\hat H_ku-1} & \color{blue}{u^T\hat H_ku_\bot}\\\color{blue}{u_\bot^T\hat H_ku} & \color{red}{u_\bot^T\hat H_ku_\bot-u_\bot^T\hat Hu_\bot}\end{bmatrix}\right\|_F^2=\\ &=\color{blue}{(u^T\hat H_ku-1)^2+\|u^T\hat H_ku_\bot\|_F^2+\|u_\bot^T\hat H_ku\|_F^2}+\color{red}{\|u_\bot^T\hat H_ku_\bot-u_\bot^T\hat Hu_\bot\|_F^2} \end{align} is obtained when the red part is zero.

Step 3. The matrix $U$ is orthogonal, hence, $$ I=UU^T=\begin{bmatrix}u & u_\bot\end{bmatrix}\begin{bmatrix}u^T \\ u_\bot^T\end{bmatrix}=uu^T+u_\bot u_\bot^T\quad\Leftrightarrow\quad u_\bot u_\bot^T=I-uu^T. $$

Best Answer

Explanations:

Related Solutions

[Math] motivation for BFGS Hessian update rule

Optimization – Understanding DFP Updating Formula in Quasi-Newton Methods

Related Question