Solved – What does the PCA().transform() method do

pcascikit learnself-study

I've been taught to think of the PCA as change of basis technique with a cleverly chosen basis. Let's say my initial data is a $m\times n$ matrix $X$ where $m$ is a number of features and $n$ is a number of measurements. I've computed covariance matrix $S$ and got eigenbasis $m\times m$ matrix $P$ (eigenvectors of $S$) which represents my new set of coordinates. I now want to transform my data to this new coordinates by $Y=PX$. Alternatively, I use sklearn.decomposition.PCA class to perform the same procedure, but the transformed data differs from what I get manually.

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

# generate some random data
m = 10
n = 100
X = np.random.randn(m, n)
X = X - X.mean(axis=1).reshape((m, 1))
S = X @ X.T / (n-1)

# manual computation
P = np.linalg.eig(S)[1] # transformation matrix P
Y = P @ X # transformed data

# using sklearn
pca = PCA()
pca.fit(X.T)
Y_sklearn = pca.transform(X.T).T

The output for the first vector in $Y$, Y[:, 0], is

array([-0.09133876, -1.53859883,  0.86409512, -2.52404208,  0.05910835,
        0.83063718,  0.52757518,  0.7412817 , -0.42611878, -0.71241571])

while for Y_sklearn[:, 0] is

array([ 1.44259169,  1.05948004,  0.87768441,  0.60333571, -1.560406  ,
        0.11799914, -1.91440021, -0.96841104,  0.41010045, -0.38189462])

I am probably making a mistake at some point, but can't find where exactly. Thanks in advance.

Best Answer

Your P matrix contains the eigenvectors as columns, so you need to reconstruct with P.T @ X in order to project your data (i.e. dot product). Now, they'll be more similar; but still not the same because np.linalg.eig doesn't return eigenvalues sorted. You can achieve the same ordering as follows:

   E = np.linalg.eig(S)
   P = E[1][:,np.argsort(-E[0])]

Finally, since eigenvectors can be $v$ or $-v$ you'll have some sign differences between your projections. But, they'll be quite similar.