PCA vs SVD – Understanding the Advantages and Differences

least squarespcasvd

I know how to calculate PCA and SVD mathematically, and I know that both can be applied to Linear Least Squares regression.

The main advantage of SVD mathematically seems to be that it can be applied to non-square matrices.

Both focus on the decomposition of the $X^\top X$ matrix. Other than the advantage of SVD mentioned, are there any additional advantages or insights provided by using SVD over PCA?

I'm really looking for the intuition rather than any mathematical differences.

Best Answer

As @ttnphns and @nick-cox said, SVD is a numerical method and PCA is an analysis approach (like least squares). You can do PCA using SVD, or you can do PCA doing the eigen-decomposition of $X^T X$ (or $X X^T$), or you can do PCA using many other methods, just like you can solve least squares with a dozen different algorithms like Newton's method or gradient descent or SVD etc.

So there is no "advantage" to SVD over PCA because it's like asking whether Newton's method is better than least squares: the two aren't comparable.

Related Solutions

Solved – LSA vs. PCA (document clustering)

PCA and LSA are both analyses which use SVD. PCA is a general class of analysis and could in principle be applied to enumerated text corpora in a variety of ways. In contrast LSA is a very clearly specified means of analyzing and reducing text. Both are leveraging the idea that meaning can be extracted from context. In LSA the context is provided in the numbers through a term-document matrix. In the PCA you proposed, context is provided in the numbers through providing a term covariance matrix (the details of the generation of which probably can tell you a lot more about the relationship between your PCA and LSA). You may want to look here for more details.
You are basically on track here. The exact reasons they are used will depend on the context and the aims of the person playing with the data.
The answer will probably depend on the implementation of the procedure you are using.
Carefully and with great art. Most consider the dimensions of these semantic models to be uninterpretable. Note that you almost certainly expect there to be more than one underlying dimension. When there is more than one dimension in factor analysis, we rotate the factor solution to yield interpretable factors. However, for some reason this is not typically done for these models. Your approach sounds like a principled way to start your art... although I'd be less than certain the scaling between dimensions is similar enough to trust a cluster analysis solution. If you want to play around with meaning, you might also consider a simpler approach in which the vectors have a direct relationship with specific words, e.g. HAL.

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Best Answer

Related Solutions

Solved – LSA vs. PCA (document clustering)

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Related Question