Solved – Difference between scikit-learn implementations of PCA and TruncatedSVD

pcascikit learnscipysvd

I understand the relation between Principal Component Analysis and Singular Value Decomposition at an algebraic/exact level. My question is about the scikit-learn implementation.

The documentation says: "[TruncatedSVD] is very similar to PCA, but operates on sample vectors directly, instead of on a covariance matrix.", which would reflect the algebraic difference between both approaches. However, it later says: "This estimator [TruncatedSVD] supports two algorithm: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.". Regarding PCA, it says: "Linear dimensionality reduction using Singular Value Decomposition of the data to project it …". And PCA implementation supports the same two algorithms (randomized and ARPACK) solvers plus another one, LAPACK. Looking into the code I can see that both ARPACK and LAPACK in both PCA and TruncatedSVD do svd on sample data X, ARPACK being able to deal with sparse matrices (using svds).

So, aside from different attributes and methods and that PCA can additionally do exact full singular value decomposition using LAPACK, PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm. First question: Is this correct?

Second question: even though LAPACK and ARPACK use scipy.linalg.svd(X) and scipy.linalg.svds(X), being X the sample matrix, they compute the singular value decomposition or eigen-decomposition of $X^T*X$ or $X*X^T$ internally. While the "randomized" solver doesn't need to compute the product. (This is relevant in connection with numerical stability, see Why PCA of data by means of SVD of the data?). Is this correct?

Relevant code: PCA line 415. TruncatedSVD line 137.

Best Answer

PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm.

No: PCA is (truncated) SVD on centered data (by per-feature mean substraction). If the data is already centered, those two classes will do the same.

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

numpy.linalg.svd and scipy.linalg.svd both rely on LAPACK _GESDD described here: http://www.netlib.org/lapack/lug/node32.html (divide and conquer driver)
scipy.sparse.linalg.svds relies on ARPACK to do a eigen value decomposition of XT . X or X . X.T (depending on the shape of the data) via the Arnoldi iteration method. The HTML user guide of ARPACK has a broken formatting which hides the computational details but the Arnoldi iteration is well described on wikipedia: https://en.wikipedia.org/wiki/Arnoldi_iteration

Here is the code for the ARPACK-based SVD in scipy:

https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/arpack.py#L1642 (search for the string for "def svds" in case of line change in the source code).

Related Solutions

Solved – Randomized SVD and singular values

We have implemented this (along with a power iteration refinement) in the scikit-learn python package.

Our implementation is able to find the exact same singular values and vectors if k + p > rank(M) as demonstrated in the tests.

If you cut (k + p) before reaching near zero singular values (i.e. in the k + p < rank(M) case) then the singular vectors are indeed different from the ones you get with the un-truncated version but they might still be very useful in practice for features extraction in machine learning: for instance 'truncated' eigenfaces at 150 work as good for face recognition task with SVM as the top 150 first singular vectors of the full decomposition even though the rank of my faces dataset seems to be much higher.

This randomized / truncated SVD method looks really interesting in practice: it can really cut down the computation time as shown in this benchmark:

comparing randomized andv deterministi SVD implementations

Solved – Why PCA of data by means of SVD of the data

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Best Answer

Related Solutions

Solved – Randomized SVD and singular values

Solved – Why PCA of data by means of SVD of the data

Related Question