As
$$
f(x) = f(x_0)+\frac{\partial f(x_0)}{\partial x}(x-x_0) + \frac 12(x-x_0)^{\dagger}\frac{\partial^2f(x_0)}{\partial^2x}(x-x_0) + O(|x-x_0|^3)
$$
deriving again we have
$$
\nabla f(x)\approx\nabla f(x_0) + \frac{\partial^2f(x_0)}{\partial^2x}(x-x_0)\approx 0
$$
because $f(x) = g_1(x)^2+g_2(x)^2+g_3(x)^2$ so we get
$$
x_k = x_{k-1}-H_{k-1}^{-1}\nabla f(x_{k-1})
$$
with
$$
\nabla f(x) = \frac{\partial f(x_0)}{\partial x} = J\\
H = \frac{\partial^2f(x)}{\partial^2x}\approx J^{\dagger}\cdot J
$$
NOTE
Calling $G(x) = \{g_1(x),g_2(x),\cdots,g_m(x)\}^{\dagger}$ and $f(x) = \frac 12G^{\dagger}(x)\cdot G(x)$ we have
$$
\nabla f(x) = J(x)^{\dagger}\cdot G(x)
$$
and
$$
\nabla^2 f(x) = H(x) = \nabla J(x)^{\dagger}\cdot G(x) + J(x)^{\dagger}\cdot \nabla G(x) = J(x)^{\dagger}\cdot J(x) +\sum_kg_k(x)\nabla^2 g_k(x)
$$
So we can evaluate the approximation
$$
H = \frac{\partial^2f(x)}{\partial^2x}\approx J(x)^{\dagger}\cdot J(x)
$$
Attached a MATHEMATICA script implementing the Gauss-Newton iterative procedure
X = {x, y, z};
f[x_, y_, z_] := {{2 x - y z - 1}, {1 - x + y - Exp[x - z]}, {-x - 2 y + 3 z}}
g[x_, y_, z_] := D[f[x, y, z], {X}]
x0 = RandomReal[{-2, 2}];
y0 = RandomReal[{-2, 2}];
z0 = RandomReal[{-2, 2}];
X0 = {x0, y0, z0};
f0 = Flatten[f[x0, y0, z0], 1];
n = 20;
error = 0.0001;
path = {{0, x0, y0, z0, f0.f0}};
For[k = 1, k <= n, k++,
Jf = Flatten[Transpose[g[x, y, z] /. Thread[X -> X0]], 1];
JR = Transpose[Transpose[f[x, y, z]].Jf] /. Thread[X -> X0];
Hf = Transpose[Jf].Jf;
p = LinearSolve[Hf, -JR];
If[Max[Abs[p]] < error, Break[]];
X0 = Flatten[X0 + p];
f0 = Flatten[f[x, y, x], 1] /. Thread[X -> X0];
AppendTo[path, {k, x, y, z, f0.f0} /. Thread[X -> X0]]
]
with a typical outcome
$$
\left[
\begin{array}{ccccc}
k & x & y & z & f^{\dagger}\cdot f\\
0 & -0.18083 & -0.360527 & -1.54181 & 27.0262 \\
1 & 0.419264 & -0.249572 & -0.0266268 & 2.23994 \\
2 & 0.441999 & 0.376084 & 0.398056 & 0.101378 \\
3 & 0.693323 & 0.692737 & 0.692933 & 0.00877097 \\
4 & 0.846417 & 0.846417 & 0.846417 & 0.000556373 \\
5 & 0.923209 & 0.923209 & 0.923209 & 0.0000347735 \\
6 & 0.961604 & 0.961604 & 0.961604 & \text{2.1733}\cdot 10^{-6}
\\
7 & 0.980802 & 0.980802 & 0.980802 & \text{1.3583}\cdot 10^{-7} \\
8 & 0.990401 & 0.990401 & 0.990401 & \text{8.4896}\cdot 10^{-9} \\
9 & 0.995201 & 0.995201 & 0.995201 & \text{5.3060}\cdot 10^{-10}
\\
10 & 0.9976 & 0.9976 & 0.9976 & \text{3.3162}\cdot 10^{-11} \\
11 & 0.9988 & 0.9988 & 0.9988 & \text{2.0726}\cdot 10^{-12} \\
12 & 0.9994 & 0.9994 & 0.9994 & \text{1.2954}\cdot 10^{-13} \\
13 & 0.9997 & 0.9997 & 0.9997 & \text{8.0963}\cdot 10^{-15} \\
14 & 0.99985 & 0.99985 & 0.99985 & \text{5.0603}\cdot 10^{-16} \\
\end{array}
\right]
$$
Attached a script in octave
function [X0,k] = GaussNewton(x0,y0,z0,n,error)
format long
%n = 20;
%error = 0.0001;
%x0 = 0;
%y0 = 0;
%z0 = 0;
X0 = [x0;y0;z0];
for k = 1:n
x0 = X0(1);
y0 = X0(2);
z0 = X0(3);
Jf = g(x0,y0,z0);
JR = (f(x0,y0,z0)'*Jf)';
Hf = Jf'*Jf;
p =-inv(Hf)*JR;
if max(abs(p)) < error
break;
end
X0 = X0 + p;
end
end
function fx = f(x,y,z)
fx = [2.*x-y*z-1.;1.-x+y-exp(x-z);-x-2.*y+3.*z];
end
function dfx = g(x,y,z)
dfx = [2.,-z,-y;-exp(x-z)-1.,1.,exp(x-z);-1.,-2.,3.];
end
What you wrote is intuitively how I think about the Gauss-Newton method. That is, for a (single-valued) loss function, $S(\beta)$, we iteratively move forward as:
$$\beta_{n+1} = \beta_{n} - S(\beta_n) / S^{\prime}(\beta_n)$$
to approach a root. And so for the multivariate root-finding problem for $S^{\prime}(\beta)$, we iteratively move forward as:
$$\beta_{n+1} = \beta_{n} - S^{\prime}(\beta_n) / S^{\prime \prime}(\beta_n)$$
to approach the derivative's root (a minimum/maximum).
This comparison is usually enough for my colleagues. However, to be more rigorous, I would start with the multivariate Taylor's expansion around $\beta_n$:
$$S^{\prime}(\beta_{n+1}) = S^{\prime}(\beta_n + \delta_n) = S^{\prime}(\beta_n) + S^{\prime \prime}(\beta_n) \delta_n + ...$$
Stopping at the first term and expressing these as vectors:
$$\nabla S(\beta_{n+1}) = \nabla S (\beta_n) + \big( \nabla S(\beta_n) \nabla^T \big) \delta_n$$
Assume that $\beta_{n+1}$ is the solution so $\nabla S(\beta_{n+1}) = \mathbf{0}$ and, as in your notation, $\nabla S(\beta_{n}) = \mathbf{g}$ and $\big( \nabla S(\beta_n) \nabla^T \big) = \mathbf{H}$. Thus:
$$\mathbf{0} = \mathbf{g} + \mathbf{H} (\beta_{n+1}-\beta_n)\\
\implies \beta_{n+1} = \beta_{n} - \mathbf{H}^{-1} \mathbf{g}$$
If $\mathbf{H}$ is not invertible or ill-conditioned, there are tactics to solve for $\beta_{n+1}$. If it is not square, you can define a pseudo-inverse. If it nearly linearly dependent, normalization can be applied. And, just to note, I removed the vector notation ( $\vec{\cdot}$ ) for $\beta$ above, but I still treated it as a vector.
Sections 1.2 and 1.4 in this are a nice summary: https://web.mit.edu/18.06/www/Spring17/Multidimensional-Newton.pdf
Note 1: You are right, a stationary point will not necessarily be the global minimum. For a well-behaved loss function $S$, the global minimum will be a stationary point. Minimizing $S$ involves finding a local minimum (a stationary point). The only tricky thing is determining whether that is the global minimum...
Note 2: I think 'Householder's method' for root-finding involves transforming the loss function $S$ to find a root more efficiently; the actually iteration based on the derivative is the same for both this and Newton's method.
Best Answer
Let's comment on the algorithm in general, with the notation of your Wikipedia link. Since $J_{ij}=\frac{\partial r_i}{\partial\beta_j}$,$$\frac{\partial}{\partial\beta_j}\sum_ir_i^2=2\sum_iJ_{ij}r_i=2(J^Tr)_j,$$or in vector notation $\nabla\sum_ir_i^2=2J^Tr$. While minimising this sum of squares obtains $J^Tr=0$, it doesn't require entries of $J$ to be especially small. (In the special case where all $r_i$ have the same nonzero value, $\sum_iJ_{ij}=0$ for each $j$, i.e. each column sums to $0$.) So if you want a sanity check, compute $J^Tr$ at the algorithm's estimate of $\beta$.