[Math] SVD by QR and Choleski decomposition – What is going on

cholesky decompositionlinear algebramatricesmatrix decompositionsvd

Here's an algorithm I found that performs Singular Value Decomposition. I preferred this algorithm because it can be parallelized, and I don't have to calculate the huge $AA^T$ matrix when the number of rows are very large:

Let A be the matrix for which SVD has to be performed.

$A = U{\Sigma}V^T$

1) First perform $QR$ decomposition of $A$.

$A=QR$

We calculate this by performing the 2nd and the 3rd step.

2) To calculate the $R$ matrix, calculate $A^TA$ and find the Cholesky decomposition of $A^TA$.

$R = Cholesky(A^TA)$

3) Calculate the $Q$ matrix:

$Q = AR^{-1}$

4) Also, perform SVD on the $R$ matrix (a small matrix when compared to A).

$U_R{\Sigma}V^T = R$

The $\Sigma$ and $V$ matrix for $R$ will be the same as for the SVD for $A$. Now calculate the $U$ matrix:

$U = QU_R$

Now, I know the concept of SVD through eigenvectors and eigenvalues of $AA^T$ and $A^TA$, but what is going on here? Is there an intuitive explanation?

Best Answer

The reason calculating R by the Cholesky factorization is worse than other methods is because it works on sums of squared and crossproducts of floating point numbers, which has been known since the 1980's to be numerically less accurate than the Householder and Givens QR algorithms. In the 1990's during a short course on PLS, Sevante Wold, explained that many multivariate analysis software packages used the Householder algorithm to calculate PC's (for PCA) because it is the numerically most stable of the available algorithms.The Householder algorithm uses planar reflections and the Givens algorithm uses planar rotations. The advantage of the Cholesky algorithm is memory usage which is much less than the other algorithms. However, this is less of an issue now, but was a serious concern in the 1960s and into the early 1990s.

I have generated random matrices for which the above mentioned algorithm does not give the same results as calculating the SVD of A directly from A. This is likely to be true when A contains large values, which when squared, have more digits than can be represented by the the computer being used for the calculations. (Such large numbers are 'rounded' or represented by the nearest floating point number.) Keep in mind that intermediate steps in many data analyses will often generate numbers which push the limits of floating point number systems and that floating point number systems are not the same as the theoretical number systems we learned about in school.

One nice feature of the Givens algorithm (pointed out to me by a friend who worked for SAS many years ago) when used in least-squares problems is that one can add more rows to the end of A and merely spend the time to update Q and R by continuing to do more planar rotations. Also, the Givens algorithms yields what are called least-squares residuals which are statistically independent/uncorrelated. For some situations, having independent residuals is very important and the Givens algorithm calculates them. However, this independence comes at a cost: if A has p columns, then p of the least-squares residuals will be 0.