Is this understanding of the derivation of the Gauss-Newton algorithm correct

least squares

Given a loss function $S$, with some data and some function which we want to approximate the data with, etc., the Gauss Newton algorithm for finding parameters (packed in a vector) $\vec\beta$ of the function $f$ that best minimise the loss, $S$, as $\vec\beta_{n+1}=\vec\beta_n-\mathbf{H^{-1}g}$, where $\mathbf{H}$ is the Hessian matrix of $S$, and $\mathbf{g}$ is the gradient vector of $S$, both w.r.t $\vec\beta$.

The first order Householder's method for root-finding would suggest that minimising $S(\vec\beta)$ involves finding $S(\vec\beta)/S'(\vec\beta)$: but we want to minimise the gradient of $S$, because it is when the gradient of $S$ is zero that we will find a stationary point. I assume this is more practical, as a genuine root of $S$ is almost certain not to be found, but a stationary point would imply that that point is optimal, even if not a root.

— But surely minimising $S$ is a better idea anyway, since a stationary point is not guaranteed to be the global minimum?

So we want to find $\nabla S(\vec\beta)/\nabla S'(\vec\beta)$ which I assume is $\mathbf{H^{-1}g}$, if we take $\mathbf{H^{-1}}$ to be "dividing" by the Hessian.

Is this a correct justification for the Gauss-Newton, or is this way off the mark? I don't want my notes to have lousy justifications…

Many thanks.

Best Answer

What you wrote is intuitively how I think about the Gauss-Newton method. That is, for a (single-valued) loss function, $S(\beta)$, we iteratively move forward as:

$$\beta_{n+1} = \beta_{n} - S(\beta_n) / S^{\prime}(\beta_n)$$

to approach a root. And so for the multivariate root-finding problem for $S^{\prime}(\beta)$, we iteratively move forward as:

$$\beta_{n+1} = \beta_{n} - S^{\prime}(\beta_n) / S^{\prime \prime}(\beta_n)$$

to approach the derivative's root (a minimum/maximum).


This comparison is usually enough for my colleagues. However, to be more rigorous, I would start with the multivariate Taylor's expansion around $\beta_n$:

$$S^{\prime}(\beta_{n+1}) = S^{\prime}(\beta_n + \delta_n) = S^{\prime}(\beta_n) + S^{\prime \prime}(\beta_n) \delta_n + ...$$

Stopping at the first term and expressing these as vectors:

$$\nabla S(\beta_{n+1}) = \nabla S (\beta_n) + \big( \nabla S(\beta_n) \nabla^T \big) \delta_n$$

Assume that $\beta_{n+1}$ is the solution so $\nabla S(\beta_{n+1}) = \mathbf{0}$ and, as in your notation, $\nabla S(\beta_{n}) = \mathbf{g}$ and $\big( \nabla S(\beta_n) \nabla^T \big) = \mathbf{H}$. Thus:

$$\mathbf{0} = \mathbf{g} + \mathbf{H} (\beta_{n+1}-\beta_n)\\ \implies \beta_{n+1} = \beta_{n} - \mathbf{H}^{-1} \mathbf{g}$$

If $\mathbf{H}$ is not invertible or ill-conditioned, there are tactics to solve for $\beta_{n+1}$. If it is not square, you can define a pseudo-inverse. If it nearly linearly dependent, normalization can be applied. And, just to note, I removed the vector notation ( $\vec{\cdot}$ ) for $\beta$ above, but I still treated it as a vector.

Sections 1.2 and 1.4 in this are a nice summary: https://web.mit.edu/18.06/www/Spring17/Multidimensional-Newton.pdf


Note 1: You are right, a stationary point will not necessarily be the global minimum. For a well-behaved loss function $S$, the global minimum will be a stationary point. Minimizing $S$ involves finding a local minimum (a stationary point). The only tricky thing is determining whether that is the global minimum...

Note 2: I think 'Householder's method' for root-finding involves transforming the loss function $S$ to find a root more efficiently; the actually iteration based on the derivative is the same for both this and Newton's method.

Related Question