I've been reading some documentation about PCA and trying to use scikit-learn to implement it. But I struggle to understand what are the attributes returned by sklearn.decompositon.PCA
From what I read here and the name of this attribute my first guess would be that the attribute .components_ is the matrix of principal components, meaning if we have data set X which can be decomposed using SVD as
X = USV^T
then I would expect the attribute .components_ to be equal to
XV = US.
To clarify this I took the first example of the wikipedia page of Singular Value Decomposition (here), and try to implement it to see if I obtain what is expected. But I get something different. To be sure I didn't make a mistake I used scipy.linalg.svd to do the Singular Value Decomposition on my matrix X, and I obtained the result described on wikipedia:
X = np.array([[1, 0, 0, 0, 2],
[0, 0, 3, 0, 0],
[0, 0, 0, 0, 0],
[0, 2, 0, 0, 0]])
U, s, Vh = svd(X)
print('U = %s'% U)
print('Vh = %s'% Vh)
print('s = %s'% s)
output:
U = [[ 0. 1. 0. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. -1.]
[ 0. 0. 1. 0.]]
Vh = [[-0. 0. 1. 0. 0. ]
[ 0.4472136 0. 0. 0. 0.89442719]
[-0. 1. 0. 0. 0. ]
[ 0. 0. 0. 1. 0. ]
[-0.89442719 0. 0. 0. 0.4472136 ]]
s = [ 3. 2.23606798 2. 0. ]
But with sk-learn I obtain this:
pca = PCA(svd_solver='auto', whiten=True)
pca.fit(X)
print(pca.components_)
print(pca.singular_values_)
and the output is
[[ -1.47295237e-01 -2.15005028e-01 9.19398392e-01 -0.00000000e+00
-2.94590475e-01]
[ 3.31294578e-01 -6.62589156e-01 1.10431526e-01 0.00000000e+00
6.62589156e-01]
[ -2.61816759e-01 -7.17459719e-01 -3.77506920e-01 0.00000000e+00
-5.23633519e-01]
[ 8.94427191e-01 -2.92048264e-16 -7.93318415e-17 0.00000000e+00
-4.47213595e-01]]
[ 2.77516885e+00 2.12132034e+00 1.13949018e+00 1.69395499e-16]
which is not equal to SV^T (I spare you the matrix multiplication, since anyway you can see that the singular values are different from the one obtained above).
I tried to see what happened if I set the parameter withened to False or the parameter svd_solver to 'full' but it doesn't change the result.
Do you see a mistake somewhere, or do you have an explanation?
Best Answer
Annoyingly there is no SKLearn documentation for this attribute, beyond the general description of the PCA method.
Here is a useful application of
pca.components_
in a classic facial-recognition project (using data bundled with SKL, so you don't have to download anything extra). Working through this concise notebook is the best way to get a feel for the definition & application ofpca.components_
From that project, and this answer over on StackOverflow, we can learn that
pca.components_
is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors usingpca.components_
, here's how to get eigenvalues.For further info on the definitions & applications of eigenvectors vs loadings (including the equation that links all three concepts), see here.
For a 2nd project/notebook applying
pca.components_
to (the same) facial recognition data, see here. It features a more traditional scree plot than the first project cited above.