Solved – pca.components_ in sk-learn

pcascikit learnsvd

I've been reading some documentation about PCA and trying to use scikit-learn to implement it. But I struggle to understand what are the attributes returned by sklearn.decompositon.PCA
From what I read here and the name of this attribute my first guess would be that the attribute .components_ is the matrix of principal components, meaning if we have data set X which can be decomposed using SVD as

X = USV^T

then I would expect the attribute .components_ to be equal to

XV = US.

To clarify this I took the first example of the wikipedia page of Singular Value Decomposition (here), and try to implement it to see if I obtain what is expected. But I get something different. To be sure I didn't make a mistake I used scipy.linalg.svd to do the Singular Value Decomposition on my matrix X, and I obtained the result described on wikipedia:

X = np.array([[1, 0, 0, 0, 2],
          [0, 0, 3, 0, 0],
          [0, 0, 0, 0, 0],
          [0, 2, 0, 0, 0]])
U, s, Vh = svd(X)
print('U = %s'% U)
print('Vh = %s'% Vh)
print('s = %s'% s)

output:

U = [[ 0.  1.  0.  0.]
[ 1.  0.  0.  0.]
[ 0.  0.  0. -1.]
[ 0.  0.  1.  0.]]
Vh = [[-0.          0.          1.          0.          0.        ]
[ 0.4472136   0.          0.          0.          0.89442719]
[-0.          1.          0.          0.          0.        ]
[ 0.          0.          0.          1.          0.        ]
[-0.89442719  0.          0.          0.          0.4472136 ]]
s = [ 3.          2.23606798  2.          0.        ]

But with sk-learn I obtain this:

pca = PCA(svd_solver='auto', whiten=True)
pca.fit(X)
print(pca.components_)
print(pca.singular_values_)

and the output is

[[ -1.47295237e-01  -2.15005028e-01   9.19398392e-01  -0.00000000e+00
-2.94590475e-01]
[  3.31294578e-01  -6.62589156e-01   1.10431526e-01   0.00000000e+00
6.62589156e-01]
[ -2.61816759e-01  -7.17459719e-01  -3.77506920e-01   0.00000000e+00
-5.23633519e-01]
[  8.94427191e-01  -2.92048264e-16  -7.93318415e-17   0.00000000e+00
-4.47213595e-01]]
[  2.77516885e+00   2.12132034e+00   1.13949018e+00   1.69395499e-16]

which is not equal to SV^T (I spare you the matrix multiplication, since anyway you can see that the singular values are different from the one obtained above).
I tried to see what happened if I set the parameter withened to False or the parameter svd_solver to 'full' but it doesn't change the result.

Do you see a mistake somewhere, or do you have an explanation?


Best Answer

Annoyingly there is no SKLearn documentation for this attribute, beyond the general description of the PCA method.

Here is a useful application of pca.components_ in a classic facial-recognition project (using data bundled with SKL, so you don't have to download anything extra). Working through this concise notebook is the best way to get a feel for the definition & application of pca.components_

From that project, and this answer over on StackOverflow, we can learn that pca.components_ is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors using pca.components_, here's how to get eigenvalues.

For further info on the definitions & applications of eigenvectors vs loadings (including the equation that links all three concepts), see here.

For a 2nd project/notebook applying pca.components_ to (the same) facial recognition data, see here. It features a more traditional scree plot than the first project cited above.