[Math] Linear system of equations and multiple linear regression: Numerical solving

gaussian eliminationnumerical linear algebraregression

I am currently implementing a test procedure for data, namely a linear form of the Kramers-Kronig relations (paper here: http://jes.ecsdl.org/content/142/6/1885.abstract).

This includes solving a linear system of equations with either the same amount of parameters as datapoints or less.

I have implemented this by creating a N x M matrix where N are the number of rows equal to the number of datapoints and M the number of columns, equal to the number of parameters and populating this matrix as outlined in the paper. I call this matrix X.

I have a second N x 1 matrix that contains the Y-values, called Y.

To solve this I calculate the equation

\begin{eqnarray}
(X^T*X)^-1 * X^T * Y
\end{eqnarray}

It basically works but I think I have quite large rounding errors in my inversion method (Gauss-Jordan) or somewhere else which leads to the problem that the larger M I use, the worse the actual fits gets.

Now I was thinking that in the case for M=N I don't need this approach but could instead just solve the system of linear equations directly using Gauss-Jordan on the matrix

\begin{eqnarray}
(X|Y)
\end{eqnarray}

Is this basically correct? If so, why is performing the Gauss-Jordan algorithm the same as solving the first equation above in the case of M=N?

(These questions may be hard to answer without further details about the implementation:) Where could the rounding issues stem from? I also threw the same data into my implementation of the Levenberg-Marquardt algorithm that uses the same inversion algorithm and here the solution fits the data much better. How can this be if there were large rounding errors?

Best Answer

First question: If $X$ is square and nonsingular, then the equivalence between normal equations and "ordinary" system follows from the fact that $(X^TX)^{-1}X^T=X^{-1}X^{-T}X^T=X^{-1}$.

Second question: Linear least squares (LS) problems are generally more sensitive to the rounding errors than "ordinary" systems and avoiding them requires some attention.

First, one should never compute $X^TX$ explicitly unless there is a very good reason for that (e.g., when only a very rough approximation to the LS solution is needed and $X$ is not very ill-conditioned). Usually, a good idea is to use some sort of orthogonalisation algorithm applied to $X$ instead.

Second, the LS solution can be much more sensitive to the data perturbations (and hence rounding errors). Just for comparison, consider the following. Assume $X\in\mathbb{R}^{N\times M}$ has full column rank, $Y\in\mathbb{R}^{N}$, and we seek to solve for $Z=X^{\dagger}Y=(X^TX)^{-1}X^TY$. Assume that instead of this $Z$ we get a $\tilde{Z}=\tilde{X}^{\dagger}\tilde{Y}$, where $\tilde{X}=X+E$ and $\tilde{Y}=Y+F$ with $\|E\|_2\leq\epsilon\|X\|_2$, $\|F\|_2\leq\epsilon\|Y\|_2$, and $\epsilon\kappa_2(X)<1$. That is, $\tilde{Z}$ is a solution of the LS problem with some small perturbation of the given data.

Now in the case of the "ordinary" system with $M=N$, we have the following bound relating the forward error (relative norm of $Z-\tilde{Z}$) to the data perturbations: $$ \frac{\|Z-\tilde{Z}\|_2}{\|Z\|_2}\leq\frac{2\epsilon\kappa_2(X)}{1-\epsilon\kappa_2(X)}. $$ This means that the sensitivity of the solution of the ordinary (consistent) linear system with respect to the data perturbations is essentially given by the condition number of $X$. However, for the general LS problem, we have $$ \frac{\|Z-\tilde{Z}\|}{\|Z\|}\leq \frac{\epsilon\kappa_2(X)}{1-\epsilon\kappa_2(X)} \left[2+(\kappa_2(X)+1)\frac{\|R\|_2}{\|X\|_2\|Z\|_2}\right], $$ where $R=Y-XZ$. It means that for problems with relatively large residuals (where the data are not "well correlated") the sensitivity of the LS solution is in fact given by the square of the condition number of $X$.

Related Question