Solved – PCA principal components in sklearn not matching eigen-vectors of covariance calculated by numpy

numpypcapythonscikit learn

I was trying to replicate PCA in sklearn's PCA API using numpy using PCA in numpy and sklearn produces different results.
I noticed that:

  • eigenvalues are same as the PCA object's explained_variance_ atribute along with the order
  • eigenvectors are not same. Here is my code:
import numpy as np
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
X = datasets.load_iris()['data']
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=4)
pca.fit(X_scaled)

print('Explained Variance = ', pca.explained_variance_)
print('Principal Components = ', pca.components_)

This gives me:

Explained Variance =  [2.93808505 0.9201649  0.14774182 0.02085386]
Principal Components =  [[ 0.52106591 -0.26934744  0.5804131   0.56485654]
 [ 0.37741762  0.92329566  0.02449161  0.06694199]
 [-0.71956635  0.24438178  0.14212637  0.63427274]
 [-0.26128628  0.12350962  0.80144925 -0.52359713]]

Using Numpy:

cov = np.cov(X_scaled.T)
eig_val, eig_vec = np.linalg.eig(cov)
print('Eigenvalues = ', eig_val)
print('Eigenvectors = ', eig_vec)

This gives me:

Eigenvalues =  [2.93808505 0.9201649  0.14774182 0.02085386]
Eigenvectors =  [[ 0.52106591 -0.37741762 -0.71956635  0.26128628]
 [-0.26934744 -0.92329566  0.24438178 -0.12350962]
 [ 0.5804131  -0.02449161  0.14212637 -0.80144925]
 [ 0.56485654 -0.06694199  0.63427274  0.52359713]]

Notice that eigenvalues are exactly the same as pca.explained_variance_ ie unlike the post PCA in numpy and sklearn produces different results suggests, we do get the eigenvalues by decreasing order in numpy (at least in this example) but eigenvectors are not same as pca.components_. Why is this and how do I replicate the exact result of Sklearn's PCA API manually.

Best Answer

While this is a pure python related question which is not fitted here for CrossValidated, let me help you anyway. Both procedures find the correct eigenvectors. The difference is in its representation. While PCA() lists the entries of an eigenvectors rowwise, np.linalg.eig() lists the entries of the eigenvectors columnwise. Remember that eigenvectors are only unique up to a sign. Indeed, a simple check yields:

print(abs(eig_vec.T.round(10))==abs(pca.components_.round(10)))
[[ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True]])