Understanding the Gauss-Newton Method

jacobianleast squaresnonlinear optimization

I have successfully implemented the Gauss-Newton method to a simple nonlinear least-squares problem as shown in the Wikepedia page here. As I understand it, the method uses the derivative of the objective function, which is the sum of squares of the residuals, in order to find its root. Therefore, by knowing the root of the derivative, we also know where the minimum is (assuming we don't have any saddle points).

Hence my question is:

How do I check that the derivative of the objective function is zero at the optimum point in the Gauss-Newton method? Let me elaborate:

We have the following objective function below which we are looking to minimise for some optimum parameter $\beta$:

$$S(\beta)=\Sigma (y_i-f(x_i,\beta))^2$$

The Gauss-Newton iteration is therefore:

$$\beta^{(k+1)}=\beta^{(k)}+(\mathbf{J}^{\mathrm{T}}\mathbf{J})^{-1}\mathbf{J}^{\mathrm{T}}(\mathbf{y}-f(\mathbf{x},\beta^{(k)}))$$

By iterating the above several times, the optimum parameter vector $\beta_o$ is found such that $S(\beta_o)$ is minimum. Therefore, shouldn't the sum of all the rows in each column of $\mathbf{J}(\beta_o)$ be equal to zero (or at least a very small number when accounting for numerical errors)?

I have checked the above in my algorithm which converges correctly to the known solution, and it doesn't hold true. So for the sake of my improved understanding on this method I was wondering how to check that the derivative is zero at the optimum point in the Gauss-Newton method.

Thanks for your help in advance.

Best Answer

Let's comment on the algorithm in general, with the notation of your Wikipedia link. Since $J_{ij}=\frac{\partial r_i}{\partial\beta_j}$,$$\frac{\partial}{\partial\beta_j}\sum_ir_i^2=2\sum_iJ_{ij}r_i=2(J^Tr)_j,$$or in vector notation $\nabla\sum_ir_i^2=2J^Tr$. While minimising this sum of squares obtains $J^Tr=0$, it doesn't require entries of $J$ to be especially small. (In the special case where all $r_i$ have the same nonzero value, $\sum_iJ_{ij}=0$ for each $j$, i.e. each column sums to $0$.) So if you want a sanity check, compute $J^Tr$ at the algorithm's estimate of $\beta$.

Related Solutions

[Math] Gauss-Newton Method not converging for the function

$$ f(x) = f(x_0)+\frac{\partial f(x_0)}{\partial x}(x-x_0) + \frac 12(x-x_0)^{\dagger}\frac{\partial^2f(x_0)}{\partial^2x}(x-x_0) + O(|x-x_0|^3) $$

deriving again we have

$$ \nabla f(x)\approx\nabla f(x_0) + \frac{\partial^2f(x_0)}{\partial^2x}(x-x_0)\approx 0 $$

because $f(x) = g_1(x)^2+g_2(x)^2+g_3(x)^2$ so we get

$$ x_k = x_{k-1}-H_{k-1}^{-1}\nabla f(x_{k-1}) $$

with

$$ \nabla f(x) = \frac{\partial f(x_0)}{\partial x} = J\\ H = \frac{\partial^2f(x)}{\partial^2x}\approx J^{\dagger}\cdot J $$

NOTE

Calling $G(x) = \{g_1(x),g_2(x),\cdots,g_m(x)\}^{\dagger}$ and $f(x) = \frac 12G^{\dagger}(x)\cdot G(x)$ we have

$$ \nabla f(x) = J(x)^{\dagger}\cdot G(x) $$

and

$$ \nabla^2 f(x) = H(x) = \nabla J(x)^{\dagger}\cdot G(x) + J(x)^{\dagger}\cdot \nabla G(x) = J(x)^{\dagger}\cdot J(x) +\sum_kg_k(x)\nabla^2 g_k(x) $$

So we can evaluate the approximation

$$ H = \frac{\partial^2f(x)}{\partial^2x}\approx J(x)^{\dagger}\cdot J(x) $$

Attached a MATHEMATICA script implementing the Gauss-Newton iterative procedure

X = {x, y, z};
f[x_, y_, z_] := {{2 x - y z - 1}, {1 - x + y - Exp[x - z]}, {-x - 2 y + 3 z}}
g[x_, y_, z_] := D[f[x, y, z], {X}]
x0 = RandomReal[{-2, 2}];
y0 = RandomReal[{-2, 2}];
z0 = RandomReal[{-2, 2}];
X0 = {x0, y0, z0};
f0 = Flatten[f[x0, y0, z0], 1];
n = 20;
error = 0.0001;
path = {{0, x0, y0, z0, f0.f0}};

For[k = 1, k <= n, k++,
    Jf = Flatten[Transpose[g[x, y, z] /. Thread[X -> X0]], 1];
    JR = Transpose[Transpose[f[x, y, z]].Jf] /. Thread[X -> X0];
    Hf = Transpose[Jf].Jf;
    p = LinearSolve[Hf, -JR];
    If[Max[Abs[p]] < error, Break[]];
    X0 = Flatten[X0 + p];
    f0 = Flatten[f[x, y, x], 1] /. Thread[X -> X0];
    AppendTo[path, {k, x, y, z, f0.f0} /. Thread[X -> X0]]
]

with a typical outcome

$$ \left[ \begin{array}{ccccc} k & x & y & z & f^{\dagger}\cdot f\\ 0 & -0.18083 & -0.360527 & -1.54181 & 27.0262 \\ 1 & 0.419264 & -0.249572 & -0.0266268 & 2.23994 \\ 2 & 0.441999 & 0.376084 & 0.398056 & 0.101378 \\ 3 & 0.693323 & 0.692737 & 0.692933 & 0.00877097 \\ 4 & 0.846417 & 0.846417 & 0.846417 & 0.000556373 \\ 5 & 0.923209 & 0.923209 & 0.923209 & 0.0000347735 \\ 6 & 0.961604 & 0.961604 & 0.961604 & \text{2.1733}\cdot 10^{-6} \\ 7 & 0.980802 & 0.980802 & 0.980802 & \text{1.3583}\cdot 10^{-7} \\ 8 & 0.990401 & 0.990401 & 0.990401 & \text{8.4896}\cdot 10^{-9} \\ 9 & 0.995201 & 0.995201 & 0.995201 & \text{5.3060}\cdot 10^{-10} \\ 10 & 0.9976 & 0.9976 & 0.9976 & \text{3.3162}\cdot 10^{-11} \\ 11 & 0.9988 & 0.9988 & 0.9988 & \text{2.0726}\cdot 10^{-12} \\ 12 & 0.9994 & 0.9994 & 0.9994 & \text{1.2954}\cdot 10^{-13} \\ 13 & 0.9997 & 0.9997 & 0.9997 & \text{8.0963}\cdot 10^{-15} \\ 14 & 0.99985 & 0.99985 & 0.99985 & \text{5.0603}\cdot 10^{-16} \\ \end{array} \right] $$

Attached a script in octave

function [X0,k] = GaussNewton(x0,y0,z0,n,error)
  format long   
  %n     = 20;
  %error = 0.0001;
  %x0    = 0;
  %y0    = 0;
  %z0    = 0;

  X0 = [x0;y0;z0];

  for k = 1:n
    x0 = X0(1);
    y0 = X0(2);
    z0 = X0(3);
    Jf = g(x0,y0,z0);
    JR = (f(x0,y0,z0)'*Jf)';
    Hf = Jf'*Jf;
    p  =-inv(Hf)*JR;
    if max(abs(p)) < error
      break;
    end
    X0 = X0 + p;
  end

end

function fx = f(x,y,z)
  fx = [2.*x-y*z-1.;1.-x+y-exp(x-z);-x-2.*y+3.*z];
end

function dfx = g(x,y,z)
  dfx = [2.,-z,-y;-exp(x-z)-1.,1.,exp(x-z);-1.,-2.,3.];
end

Is this understanding of the derivation of the Gauss-Newton algorithm correct

What you wrote is intuitively how I think about the Gauss-Newton method. That is, for a (single-valued) loss function, $S(\beta)$, we iteratively move forward as:

$$\beta_{n+1} = \beta_{n} - S(\beta_n) / S^{\prime}(\beta_n)$$

to approach a root. And so for the multivariate root-finding problem for $S^{\prime}(\beta)$, we iteratively move forward as:

$$\beta_{n+1} = \beta_{n} - S^{\prime}(\beta_n) / S^{\prime \prime}(\beta_n)$$

to approach the derivative's root (a minimum/maximum).

This comparison is usually enough for my colleagues. However, to be more rigorous, I would start with the multivariate Taylor's expansion around $\beta_n$:

$$S^{\prime}(\beta_{n+1}) = S^{\prime}(\beta_n + \delta_n) = S^{\prime}(\beta_n) + S^{\prime \prime}(\beta_n) \delta_n + ...$$

Stopping at the first term and expressing these as vectors:

$$\nabla S(\beta_{n+1}) = \nabla S (\beta_n) + \big( \nabla S(\beta_n) \nabla^T \big) \delta_n$$

Assume that $\beta_{n+1}$ is the solution so $\nabla S(\beta_{n+1}) = \mathbf{0}$ and, as in your notation, $\nabla S(\beta_{n}) = \mathbf{g}$ and $\big( \nabla S(\beta_n) \nabla^T \big) = \mathbf{H}$. Thus:

$$\mathbf{0} = \mathbf{g} + \mathbf{H} (\beta_{n+1}-\beta_n)\\ \implies \beta_{n+1} = \beta_{n} - \mathbf{H}^{-1} \mathbf{g}$$

If $\mathbf{H}$ is not invertible or ill-conditioned, there are tactics to solve for $\beta_{n+1}$. If it is not square, you can define a pseudo-inverse. If it nearly linearly dependent, normalization can be applied. And, just to note, I removed the vector notation ( $\vec{\cdot}$ ) for $\beta$ above, but I still treated it as a vector.

Sections 1.2 and 1.4 in this are a nice summary: https://web.mit.edu/18.06/www/Spring17/Multidimensional-Newton.pdf

Note 1: You are right, a stationary point will not necessarily be the global minimum. For a well-behaved loss function $S$, the global minimum will be a stationary point. Minimizing $S$ involves finding a local minimum (a stationary point). The only tricky thing is determining whether that is the global minimum...

Note 2: I think 'Householder's method' for root-finding involves transforming the loss function $S$ to find a root more efficiently; the actually iteration based on the derivative is the same for both this and Newton's method.

Best Answer

Related Solutions

[Math] Gauss-Newton Method not converging for the function

Is this understanding of the derivation of the Gauss-Newton algorithm correct

Related Question