Solved – Difference between scikit-learn implementations of PCA and TruncatedSVD

pcascikit learnscipysvd

I understand the relation between Principal Component Analysis and Singular Value Decomposition at an algebraic/exact level. My question is about the scikit-learn implementation.

The documentation says: "[TruncatedSVD] is very similar to PCA, but operates on sample vectors directly, instead of on a covariance matrix.", which would reflect the algebraic difference between both approaches. However, it later says: "This estimator [TruncatedSVD] supports two algorithm: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.". Regarding PCA, it says: "Linear dimensionality reduction using Singular Value Decomposition of the data to project it …". And PCA implementation supports the same two algorithms (randomized and ARPACK) solvers plus another one, LAPACK. Looking into the code I can see that both ARPACK and LAPACK in both PCA and TruncatedSVD do svd on sample data X, ARPACK being able to deal with sparse matrices (using svds).

So, aside from different attributes and methods and that PCA can additionally do exact full singular value decomposition using LAPACK, PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm. First question: Is this correct?

Second question: even though LAPACK and ARPACK use scipy.linalg.svd(X) and scipy.linalg.svds(X), being X the sample matrix, they compute the singular value decomposition or eigen-decomposition of $X^T*X$ or $X*X^T$ internally. While the "randomized" solver doesn't need to compute the product. (This is relevant in connection with numerical stability, see Why PCA of data by means of SVD of the data?). Is this correct?

Relevant code: PCA line 415. TruncatedSVD line 137.

Best Answer

PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm.

No: PCA is (truncated) SVD on centered data (by per-feature mean substraction). If the data is already centered, those two classes will do the same.

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

  • numpy.linalg.svd and scipy.linalg.svd both rely on LAPACK _GESDD described here: http://www.netlib.org/lapack/lug/node32.html (divide and conquer driver)

  • scipy.sparse.linalg.svds relies on ARPACK to do a eigen value decomposition of XT . X or X . X.T (depending on the shape of the data) via the Arnoldi iteration method. The HTML user guide of ARPACK has a broken formatting which hides the computational details but the Arnoldi iteration is well described on wikipedia: https://en.wikipedia.org/wiki/Arnoldi_iteration

Here is the code for the ARPACK-based SVD in scipy:

https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/arpack.py#L1642 (search for the string for "def svds" in case of line change in the source code).