[Math] The SVD Solution to Linear Least Squares / Linear System of Equations

least squareslinear algebranumerical linear algebrasvdsystems of equations

I'm a little confused about the various explanations for using Singular Value Decomposition (SVD) to solve the Linear Least Squares (LLS) problem. I understand that LLS attempts fit $Ax=b$ by minimizing $\|A\hat{x}-b\|$, then calculating the vector $\hat{x}$ such that $\hat{x}=(A^{\top}A)^{-1}A^{\top}b$

But my question(s) are in relation to the two explanations given at SVD and least squares proof and Why does SVD provide the least squares solution to $Ax=b$? :

  1. Why do we need (or care to) to calculate $\hat{x}=V{\Sigma}^{-1}U^{\top}b$ where $SVD(A)=U\Sigma V^{\top}$ when $\hat{x}$ can be calculated vie at the pseudo-inverse mentioned above ($\hat{x}=(A^{\top}A)^{-1}A^{\top}b$)

  2. The first post mentioned that we are subject to the constraint that $\|\hat{x}\|=1$? What happens when the least squares solution does not have $\|\hat{x}\|=1$? Does this invalidate using SVD for the solution of $\hat{x}$ or is there a "back-door" approach?

  3. How do the answers to the questions above (as well as our approach) change when we are minimizing $Ax=0$ versus a generic $Ax=b$? Example: When the SVD of A is $U$, $\Sigma$, and $V^{\top}$ (that is $A\hat{x}=U\Sigma V^{\top}\hat{x}$), I would think we only care about the smallest singular value $\sigma_i$ in $\Sigma$ when solving $Ax=0$, since using the smallest $\sigma_i$ does not necessarily give the best fit to $u_i \sigma_i v^{\top}_i \hat{x} = b$?

Much thanks,
Jeff

Best Answer

For the full rank least squares problem, where $A \in \mathbb{K}^{m \times n},m>n=\mathrm{rank}(A)$ ($\mathbb{K}$ is the base field), the solution is $(A^T A)^{-1} A^T b$. This is a very bad way to approach the problem numerically for condition number reasons: you roughly square the condition number, so a relatively tractable problem with $\kappa=10^8$ becomes a hopelessly intractable problem with $\kappa=10^{16}$ (where we think about tractability in double precision floating point). The condition number also enters into convergence rates for certain iterative methods, so such methods often perform poorly for the normal equations.

The SVD pseudoinverse is exactly the same as the normal equations pseudoinverse i.e. $(A^T A)^{-1} A^T$. You simply compute it using the SVD and simplify. There is indeed a simplification; the end result is

$$(A^T A)^{-1} A^T = V (\Sigma^T \Sigma)^{-1} \Sigma^T V^T.$$

This means that if I know the matrix of right singular vectors $V$, then I can transform the problem of finding the pseudoinverse of $A$ to the (trivial) problem of finding the pseudoinverse of $\Sigma$.

The above is for the full rank problem. For the rank deficient problem with $m>n>\mathrm{rank}(A)$, the LS solution is not unique; in particular, $A^T A$ is not invertible. The usual choice is to choose the solution of minimal Euclidean norm (I don't really know exactly why people do this, but you do need some criterion). It turns out that the SVD pseudoinverse gives you this minimal norm solution. Note that the SVD pseudoinverse still makes sense here, although it does not take the form I wrote above since $\Sigma^T \Sigma$ is no longer invertible either. But you still obtain it in basically the same way (invert the nonzero singular values, leave the zeros alone).

One nice thing about considering the rank-deficient problem is that even in the full rank case, if $A$ has some singular value "gap", one can forget about the singular values below this gap and obtain a good approximate solution to the full rank least squares problem. The SVD is the ideal method for elucidating this.

The homogeneous problem is sort of unrelated to least squares, it is really an eigenvector problem which should be understood using different methods entirely.

Finally a fourth comment, not directly related to your three questions: in reasonably small problems, there isn't much reason to do the SVD. You still should not use the normal equations, but the QR decomposition will do the job just as well and it will terminate in an amount of time that you can know in advance.