Solved – PCA principal components in sklearn not matching eigen-vectors of covariance calculated by numpy

numpypcapythonscikit learn

I was trying to replicate PCA in sklearn's PCA API using numpy using PCA in numpy and sklearn produces different results.
I noticed that:

eigenvalues are same as the PCA object's explained_variance_ atribute along with the order
eigenvectors are not same. Here is my code:

import numpy as np
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
X = datasets.load_iris()['data']
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=4)
pca.fit(X_scaled)

print('Explained Variance = ', pca.explained_variance_)
print('Principal Components = ', pca.components_)

This gives me:

Explained Variance =  [2.93808505 0.9201649  0.14774182 0.02085386]
Principal Components =  [[ 0.52106591 -0.26934744  0.5804131   0.56485654]
 [ 0.37741762  0.92329566  0.02449161  0.06694199]
 [-0.71956635  0.24438178  0.14212637  0.63427274]
 [-0.26128628  0.12350962  0.80144925 -0.52359713]]

Using Numpy:

cov = np.cov(X_scaled.T)
eig_val, eig_vec = np.linalg.eig(cov)
print('Eigenvalues = ', eig_val)
print('Eigenvectors = ', eig_vec)

This gives me:

Eigenvalues =  [2.93808505 0.9201649  0.14774182 0.02085386]
Eigenvectors =  [[ 0.52106591 -0.37741762 -0.71956635  0.26128628]
 [-0.26934744 -0.92329566  0.24438178 -0.12350962]
 [ 0.5804131  -0.02449161  0.14212637 -0.80144925]
 [ 0.56485654 -0.06694199  0.63427274  0.52359713]]

Notice that eigenvalues are exactly the same as pca.explained_variance_ ie unlike the post PCA in numpy and sklearn produces different results suggests, we do get the eigenvalues by decreasing order in numpy (at least in this example) but eigenvectors are not same as pca.components_. Why is this and how do I replicate the exact result of Sklearn's PCA API manually.

Best Answer

While this is a pure python related question which is not fitted here for CrossValidated, let me help you anyway. Both procedures find the correct eigenvectors. The difference is in its representation. While PCA() lists the entries of an eigenvectors rowwise, np.linalg.eig() lists the entries of the eigenvectors columnwise. Remember that eigenvectors are only unique up to a sign. Indeed, a simple check yields:

print(abs(eig_vec.T.round(10))==abs(pca.components_.round(10)))
[[ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True],
   [ True,  True,  True,  True]])

Related Solutions

PCA and Factor Analysis – Calculating Communalities

Component/factor matrix is the matrix of component/factor loadings. "Loading" pertains to it, not to eigenvector matrix. That matrix is obtained from eigenvector matrix by normalizing the columns of the latter: column sum-of-squares (which are all 1 there) are brought to corresponding eigenvalues. $a_{ij}=\sqrt{\lambda_j}u_{ij}$, where $a_{ij}$ is the loading and $u_{ij}$ is the element of eigenvector matrix and $\lambda_j$ is the eigenvalue.

Solved – Principal Component Analysis and Regression in Python

Scikit-learn does not have a combined implementation of PCA and regression like for example the pls package in R. But I think one can do like below or choose PLS regression.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

df = pd.read_csv('multicollinearity.csv')
X = df.iloc[:,1:6]
y = df.response

Scikit-learn PCA

pca = PCA()

Scale and transform data to get Principal Components

X_reduced = pca.fit_transform(scale(X))

Variance (% cumulative) explained by the principal components

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

Seems like the first two components indeed explain most of the variance in the data.

10-fold CV, with shuffle

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

Do one CV to get MSE for just the intercept (no principal components in regression)

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

Do CV for the 5 principle components, adding one component to the regression at the time

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit-learn PLS regression

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))