BFGS Formula from Kullback-Leibler Divergence

linear programmingmachine learningnewton raphsonoptimizationstatistics

On page 411 in this book, the authors give the following BFGS formula $$ \boxed{\boldsymbol C_{\textrm{BFGS}} = \boldsymbol C + \underbrace{\frac{\boldsymbol g^\top\boldsymbol\delta+\boldsymbol g^\top\boldsymbol C\boldsymbol g}{(\boldsymbol g^\top\boldsymbol\delta)^2}\boldsymbol\delta\boldsymbol\delta^\top-\frac{1}{\boldsymbol g^\top\boldsymbol\delta}\left(\boldsymbol\delta\boldsymbol g^\top\boldsymbol C + (\boldsymbol\delta\boldsymbol g^\top\boldsymbol C)^\top\right)}_{\textrm{BFGS update}}}\tag{1} $$
by considering the constrained optimisation problem:
$$\begin{array}{rlcll}
\min_{\boldsymbol A} & \mathcal D(\boldsymbol 0, \boldsymbol C \mid \boldsymbol 0, \boldsymbol A) \\
\textrm{subject to}: & \boldsymbol A\boldsymbol g = \boldsymbol\delta, \\
& \boldsymbol A ~~= \boldsymbol A^\top,
\end{array}\tag{2}$$

where $$ \mathcal D(\boldsymbol 0, \boldsymbol C \mid \boldsymbol 0, \boldsymbol A) := \frac12\left(\mathrm{tr}\left(\boldsymbol A^{-1}\boldsymbol C\right) – \log\left(\det(\boldsymbol A^{-1}\boldsymbol C)\right) – n\right)$$
is the Kullback-Leibler divergence between the normal distributions $\mathcal N(\boldsymbol 0, \boldsymbol C)$ and $\mathcal N(\boldsymbol 0, \boldsymbol A)$.

Using Lagrange multipliers with the Lagrangian $$ \mathcal L(\boldsymbol A, \boldsymbol \beta) := \mathcal D(\boldsymbol 0, \boldsymbol C \mid \boldsymbol 0, \boldsymbol A) + \boldsymbol \beta^\top(\boldsymbol A\boldsymbol g – \boldsymbol \delta)$$
gives (also using symmetry of $\boldsymbol A$ and $\boldsymbol C$) $$ \frac{\partial}{\partial\boldsymbol A}\mathcal L(\boldsymbol A, \boldsymbol \beta) = \frac12\left(-\boldsymbol A^{-1}\boldsymbol C\boldsymbol A^{-1} + \boldsymbol A^{-1}\right) + \boldsymbol \beta\boldsymbol g^\top, \quad \boldsymbol{\nabla}_{\boldsymbol \beta}\mathcal L(\boldsymbol A, \boldsymbol \beta) = \boldsymbol A\boldsymbol g – \boldsymbol \delta,$$
and equating these both equal to zero gives $\boldsymbol A\boldsymbol g = \boldsymbol \delta$ and $\boldsymbol A^{-1}(\boldsymbol I – \boldsymbol C\boldsymbol A^{-1}) = -2\boldsymbol \beta \boldsymbol g^\top$. It is not obvious to me how this last expression could be used to identify the corresponding values of $\boldsymbol \beta$ and $\boldsymbol A$. I tried multiplying through by $\boldsymbol \delta$ to use the $\boldsymbol A\boldsymbol g = \boldsymbol \delta$ constraint, yielding $\boldsymbol{\delta}\boldsymbol \beta^\top\boldsymbol \delta = \frac12\left(\boldsymbol C\boldsymbol g – \boldsymbol \delta\right)$, but it is still unclear from this (1) how $\boldsymbol \beta$ is specified and (2) how $\boldsymbol{A}$ would be recovered.

What is the correct way to solve the constrained optimisation problem (2) to obtain the BFGS formula (1) with the method of Lagrange multipliers that gives values for $\boldsymbol A$ and $\boldsymbol \beta$?

Best Answer

The history of Quasi-Newton methods is replete with examples of optimizing the step-wise matrix update with respect to various merit functions.

In the 1970s, Broyden (and the other initials of BFGS) used the Frobenius norm: $$\big\|C_{k+1}-C_k\big\|_F$$

In the 1980s, Michael Todd used the condition number: $$\kappa(C)=\frac{\lambda_{max}}{\lambda_{min}}$$

In the 1990s, Byrd & Nocedal introduced the merit function: $$\psi(C) = {\rm Tr}(C) - \log(\det(C))$$

and Wolkowicz introduced his "sizing" metric: $$\omega(C) = \frac{\frac 1n{\rm Tr}(C)}{(\det C)^{1/n}} = \frac{\lambda_{arithmetic\_mean}}{\lambda_{geometric\_mean}}$$

This new KL-metric is obviously inspired by recent developments in Machine Learning, but it's not anything radically new, or even all that interesting.

If you wish to learn how to optimize with respect to some novel en vogue merit function, go back and read those classic papers from the 70s, 80s, and 90s.

Related Question