Solved – pca.components_ in sk-learn

pcascikit learnsvd

I've been reading some documentation about PCA and trying to use scikit-learn to implement it. But I struggle to understand what are the attributes returned by sklearn.decompositon.PCA
From what I read here and the name of this attribute my first guess would be that the attribute .components_ is the matrix of principal components, meaning if we have data set X which can be decomposed using SVD as

X = USV^T

then I would expect the attribute .components_ to be equal to

XV = US.

To clarify this I took the first example of the wikipedia page of Singular Value Decomposition (here), and try to implement it to see if I obtain what is expected. But I get something different. To be sure I didn't make a mistake I used scipy.linalg.svd to do the Singular Value Decomposition on my matrix X, and I obtained the result described on wikipedia:

X = np.array([[1, 0, 0, 0, 2],
          [0, 0, 3, 0, 0],
          [0, 0, 0, 0, 0],
          [0, 2, 0, 0, 0]])
U, s, Vh = svd(X)
print('U = %s'% U)
print('Vh = %s'% Vh)
print('s = %s'% s)

output:

U = [[ 0.  1.  0.  0.]
[ 1.  0.  0.  0.]
[ 0.  0.  0. -1.]
[ 0.  0.  1.  0.]]
Vh = [[-0.          0.          1.          0.          0.        ]
[ 0.4472136   0.          0.          0.          0.89442719]
[-0.          1.          0.          0.          0.        ]
[ 0.          0.          0.          1.          0.        ]
[-0.89442719  0.          0.          0.          0.4472136 ]]
s = [ 3.          2.23606798  2.          0.        ]

But with sk-learn I obtain this:

pca = PCA(svd_solver='auto', whiten=True)
pca.fit(X)
print(pca.components_)
print(pca.singular_values_)

and the output is

[[ -1.47295237e-01  -2.15005028e-01   9.19398392e-01  -0.00000000e+00
-2.94590475e-01]
[  3.31294578e-01  -6.62589156e-01   1.10431526e-01   0.00000000e+00
6.62589156e-01]
[ -2.61816759e-01  -7.17459719e-01  -3.77506920e-01   0.00000000e+00
-5.23633519e-01]
[  8.94427191e-01  -2.92048264e-16  -7.93318415e-17   0.00000000e+00
-4.47213595e-01]]
[  2.77516885e+00   2.12132034e+00   1.13949018e+00   1.69395499e-16]

which is not equal to SV^T (I spare you the matrix multiplication, since anyway you can see that the singular values are different from the one obtained above).
I tried to see what happened if I set the parameter withened to False or the parameter svd_solver to 'full' but it doesn't change the result.

Do you see a mistake somewhere, or do you have an explanation?

Best Answer

Annoyingly there is no SKLearn documentation for this attribute, beyond the general description of the PCA method.

Here is a useful application of pca.components_ in a classic facial-recognition project (using data bundled with SKL, so you don't have to download anything extra). Working through this concise notebook is the best way to get a feel for the definition & application of pca.components_

From that project, and this answer over on StackOverflow, we can learn that pca.components_ is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors using pca.components_, here's how to get eigenvalues.

For further info on the definitions & applications of eigenvectors vs loadings (including the equation that links all three concepts), see here.

For a 2nd project/notebook applying pca.components_ to (the same) facial recognition data, see here. It features a more traditional scree plot than the first project cited above.

Related Solutions

Solved – Why PCA of data by means of SVD of the data

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Solved – Relationship between SVD and PCA. How to use SVD to perform PCA

Let the data matrix $\mathbf X$ be of $n \times p$ size, where $n$ is the number of samples and $p$ is the number of variables. Let us assume that it is centered, i.e. column means have been subtracted and are now equal to zero.

Then the $p \times p$ covariance matrix $\mathbf C$ is given by $\mathbf C = \mathbf X^\top \mathbf X/(n-1)$. It is a symmetric matrix and so it can be diagonalized: $$\mathbf C = \mathbf V \mathbf L \mathbf V^\top,$$ where $\mathbf V$ is a matrix of eigenvectors (each column is an eigenvector) and $\mathbf L$ is a diagonal matrix with eigenvalues $\lambda_i$ in the decreasing order on the diagonal. The eigenvectors are called principal axes or principal directions of the data. Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables. The $j$-th principal component is given by $j$-th column of $\mathbf {XV}$. The coordinates of the $i$-th data point in the new PC space are given by the $i$-th row of $\mathbf{XV}$.

If we now perform singular value decomposition of $\mathbf X$, we obtain a decomposition $$\mathbf X = \mathbf U \mathbf S \mathbf V^\top,$$ where $\mathbf U$ is a unitary matrix and $\mathbf S$ is the diagonal matrix of singular values $s_i$. From here one can easily see that $$\mathbf C = \mathbf V \mathbf S \mathbf U^\top \mathbf U \mathbf S \mathbf V^\top /(n-1) = \mathbf V \frac{\mathbf S^2}{n-1}\mathbf V^\top,$$ meaning that right singular vectors $\mathbf V$ are principal directions and that singular values are related to the eigenvalues of covariance matrix via $\lambda_i = s_i^2/(n-1)$. Principal components are given by $\mathbf X \mathbf V = \mathbf U \mathbf S \mathbf V^\top \mathbf V = \mathbf U \mathbf S$.

To summarize:

If $\mathbf X = \mathbf U \mathbf S \mathbf V^\top$, then columns of $\mathbf V$ are principal directions/axes.
Columns of $\mathbf {US}$ are principal components ("scores").
Singular values are related to the eigenvalues of covariance matrix via $\lambda_i = s_i^2/(n-1)$. Eigenvalues $\lambda_i$ show variances of the respective PCs.
Standardized scores are given by columns of $\sqrt{n-1}\mathbf U$ and loadings are given by columns of $\mathbf V \mathbf S/\sqrt{n-1}$. See e.g. here and here for why "loadings" should not be confused with principal directions.
The above is correct only if $\mathbf X$ is centered. Only then is covariance matrix equal to $\mathbf X^\top \mathbf X/(n-1)$.
The above is correct only for $\mathbf X$ having samples in rows and variables in columns. If variables are in rows and samples in columns, then $\mathbf U$ and $\mathbf V$ exchange interpretations.
If one wants to perform PCA on a correlation matrix (instead of a covariance matrix), then columns of $\mathbf X$ should not only be centered, but standardized as well, i.e. divided by their standard deviations.
To reduce the dimensionality of the data from $p$ to $k<p$, select $k$ first columns of $\mathbf U$, and $k\times k$ upper-left part of $\mathbf S$. Their product $\mathbf U_k \mathbf S_k$ is the required $n \times k$ matrix containing first $k$ PCs.
Further multiplying the first $k$ PCs by the corresponding principal axes $\mathbf V_k^\top$ yields $\mathbf X_k = \mathbf U_k^\vphantom \top \mathbf S_k^\vphantom \top \mathbf V_k^\top$ matrix that has the original $n \times p$ size but is of lower rank (of rank $k$). This matrix $\mathbf X_k$ provides a reconstruction of the original data from the first $k$ PCs. It has the lowest possible reconstruction error, see my answer here.
Strictly speaking, $\mathbf U$ is of $n\times n$ size and $\mathbf V$ is of $p \times p$ size. However, if $n>p$ then the last $n-p$ columns of $\mathbf U$ are arbitrary (and corresponding rows of $\mathbf S$ are constant zero); one should therefore use an economy size (or thin) SVD that returns $\mathbf U$ of $n\times p$ size, dropping the useless columns. For large $n\gg p$ the matrix $\mathbf U$ would otherwise be unnecessarily huge. The same applies for an opposite situation of $n\ll p$.

Further links

What is the intuitive relationship between SVD and PCA -- a very popular and very similar thread on math.SE.
Why PCA of data by means of SVD of the data? -- a discussion of what are the benefits of performing PCA via SVD [short answer: numerical stability].
PCA and Correspondence analysis in their relation to Biplot -- PCA in the context of some congeneric techniques, all based on SVD.
Is there any advantage of SVD over PCA? -- a question asking if there any benefits in using SVD instead of PCA [short answer: ill-posed question].
Making sense of principal component analysis, eigenvectors & eigenvalues -- my answer giving a non-technical explanation of PCA. To draw attention, I reproduce one figure here:

Best Answer

Related Solutions

Solved – Why PCA of data by means of SVD of the data

Solved – Relationship between SVD and PCA. How to use SVD to perform PCA

Further links

Related Question