Solved – Confirming an understanding of SVD

matrix decompositionsvd

I'm trying to sift through the concepts of low-rank approximations, matrix factorizations and SVD. There's a lot of info out there and the rabbit hole is deep so I just want to make sure my high level understanding is accurate:

  1. A common application of SVD is to make low-rank approximations to a matrix, $A$
  2. A low-rank approximation is akin to finding an approximation that retains the minimum number of rows (or columns) that are linearly independent (the rank of $A$). Loosely speaking, we think of these remaining rows as a minimum bound on the useful information contained in $A$. So finding the minimum number is good for making matrices smaller but useful.
  3. One way to do this is to find another matrix, $B$, such that we minimize the Frobenius Norm, $\|A – B\|$
  4. Matrix factorization is an application of low-rank approximation, wherein we can find some $RQ^{T} = B$
  5. SVD is a common matrix factorization algorithm of the form $A = U \Sigma V^{T}$
  6. In addition to factorization, SVD is used to solve equations of the form $Ax=b$. In this regard, under certain conditions, SVD is preferred to QR or Cholesky decomposition.
  7. SVD is a general case (for $m \times n$ matrices) of an eigen-decomposition
  8. The best way (or a very good way??) to find a low-rank approximation to $A$ is by performing SVD
  9. SVD is very popular because it exposes eigenvalues and eigenvectors, which have a plethora of useful properties used in Machine Learning/Statistical applications, making it preferential to QR or Cholesky for factorization.
  10. Using SVD, we can re-write $A$ as a linear combination of its singular values (from the diagonal of $\Sigma$) and the left and right singular vectors, i.e. the columns of $U$ and $V^{T}$
  11. Calculating the SVD consists of finding the eigenvalues and eigenvectors of $AA^{T}$ ($U$) and $A^{T}A$ ($V$).
  12. Calculating this product is very expensive, so other algorithms (like truncated SVD, randomized SVD) are being explored

Are the above points accurate? I understand there is a lot more to these subjects, but I hope to have captured their relationship at a high level. Please let me know if I'm missing something painfully obvious.

Best Answer

I have tried to dispel some of the popular myths regarding the SVD.

  1. ``A common application of SVD is to make low-rank approximations to a matrix, $A$''

    This is one of the applications of the SVD but the SVD can do very many mathematical operations: Solve linear equations (including least-squares problems and minimum norm solutions), compute inverses (including the Moore-Penrose Pseudo inverse), compute null spaces, compute the numerical rank of a matrix, data compression, image compression. The SVD is the numerical backbone of the PCA (Principal Component Analysis). When I googled "SVD" today, I got 16.4 million hits (I do not know how many for this decomposition) and "Singular Value Decomposition" gave me 2.5 million. So there must be many applications.

  2. ``A low-rank approximation is akin to finding an approximation that retains the minimum number of rows (or columns) that are linearly independent (the rank of $A$). Loosely speaking, we think of these remaining rows as a minimum bound on the useful information contained in $A$. So finding the minimum number is good for making matrices smaller but useful.''

It depends on the definition you use to determine the numerical rank of the $m \times n$ matrix $A$. Usually, we compute the ratio $\sigma_{k}/\sigma_1$, where $\sigma_k$ is the $k$th singular value. If this ratio is less than some pre-determined tolerance then we can set the small singular values to zero. Any ratio less than the machine precision $\epsilon$ ( $2^{-53}$ in 8-byte variables ) is noise. A modest linear function of the form $f(m,n) \epsilon$ may be used for the tolerance.

A low rank approximation is of the form $$ \hat{A} = \hat{U} \hat{\Sigma} \hat{V}^t $$ where $\hat{U}$ is the first $k$ columns of $U$ and $\hat{\Sigma}$ is the principal $k \times k$ submatrix of $\Sigma$.

Note that the dimensions of $\hat{A}$ is same as $A$.

  1. ``One way to do this is to find another matrix, $B$, such that we minimise the Frobenius Norm, $\|A-B\|_{F}$.''

This is true for the Frobenius norm as well as the spectral norm, $\|A-B\|_2$, where the spectral norm is defined as the largest singular value of the matrix.

  1. ``Matrix factorisation is an application of low-rank approximation, wherein we can find some $RQ^t=B$.''

I do not understand this statement.

  1. ``SVD is a common matrix factorisation algorithm of the form $A=U\Sigma V^t$.''

The SVD is the most useful decomposition available for a rectangular matrix. It is also very useful for square matrices but there are other useful factorisations for square matrices.

  1. ``In addition to factorisation, SVD is used to solve equations of the form $Ax=b$. In this regard, under certain conditions, SVD is preferred to QR or Cholesky decomposition.''

If $A$ is a symmetric positive definite matrix, Cholesky factorisation is the preferable (inexpensive) method for linear solvers. If $A$ is a non-singular square matrix then the LU decomposition (Gaussian elimination with pivoting) can be used. When there are rank deficiencies or if the matrix is rectangular then the QR decomposition and the SVD are preferable. These cases include least-squares and minimum norm solutions. For difficult problems, the SVD is usually the best choice but there are experts who can use QR and LU decompositions in ingenious ways.

  1. ``SVD is a general case (for $m \times n$ matrices) of an eigen-decomposition.''

The SVD is based on two eigenvalue problems, $$ A^tA v_i = \sigma_i^2 v_i ,\;\;\; AA^t u_i = \sigma_i^2 u_i $$ where $u_i$ and the $v_i$ are the $i$th column of the singular vector matrices $U$ and $V$, respectively. However, numerical analysts very rarely compute $A^tA$ and $AA^t$. These two eigenvalue problems are solved implicitly without forming these numerically damaging products.

Note that $$ A v_i = \sigma_i u_i, \;\;\; A^tu_i = \sigma_i v_i $$ are the two primary singular vector expressions for the SVD. These two equations can be amalgamated to obtain an eigenvalue problem $$ \left[ \begin{array}{cc} 0 & A \\ A^t & 0 \end{array} \right] \left[ \begin{array}{c} u_i \\ v_i \end{array} \right] = \sigma_i \left[ \begin{array}{c} u_i \\ v_i \end{array} \right] . $$

  1. The best way (or a very good way??) to find a low-rank approximation to $A$ is by performing SVD

Yes.

  1. ``SVD is very popular because it exposes eigenvalues and eigenvectors, which have a plethora of useful properties used in Machine Learning/Statistical applications, making it preferential to QR or Cholesky for factorisation.''

In statistics, $A^tA$ (or $AA^t$) are sample covariance matrices if $A$ is a data matrix (with the means removed).

  1. ``Using SVD, we can re-write $A$ as a linear combination of its singular values (from the diagonal of $\Sigma$) and the left and right singular vectors, i.e. the columns of $U$ and $V^t$''

The matrix may be expressed as a sum of rank one outer products $$ A = \Sigma_{i=1}^{r} \sigma_i u_i v_i^t, \;\;\; r=\min(m,n) $$ For reduced rank approximations, use $ r < \min(m,n)$.

  1. Calculating the SVD consists of finding the eigenvalues and eigenvectors of $AA^t(U)$ and $A^tA (V)$.

No (as explained in 7.)

  1. Calculating this product is very expensive, so other algorithms (like truncated SVD, randomised SVD) are being explored

It is not expensive unless the matrix is extremely large. There are different algorithms (Lanczos methods) for very large (sparse) matrices.

For $m \ge n$, (if not transpose the matrix), the nominal cost is $O(mn^2)$ as measured by multiplication counts. However, in modern computing architectures, some of the arithmetic can be done in parallel. Furthermore, the cache management using level 3 BLAS (Basic Linear Algebra Subroutines) increases the efficiency. Thus, $O(mn^2)$ complexity is often a crude measure.

The first part of the SVD algorithm is to reduce the matrix to an $n \times n$ triangular matrix using the QR decomposition and the second part is to transform that to an $n \times n$ bidiagonal matrix. The first two steps are not iterative algorithms; they are straight forward steps which use orthogonal transformations. The iterative part is finding the SVD of the bidiagonal matrix which is performed using the QR-algorithm (not to be confused with the QR decomposition) or a divide and conquer algorithm. If only few singular values/vectors are required, then usually a bisection algorithm is used for the required singular values followed by inverse iteration to get the corresponding singular vectors.

Most of the details can be found in "Matrix Computations" by Gene Golub and Charles Van Loan.