Solved – Zero-centering the testing set after PCA on the training set

centeringmachine learningpca

I have a training set of data on which I do principal components analysis (PCA) and save the loadings/eigenvectors/coefficient matrix. I want to use the eigenvectors to transform my testing data into the same principal component space, I know I just do matrix multiplication between the test data and the eigenvector matrix and there are other posts that explain this.

However, I calculate the PCs from the training data after centering the data so the mean is zero (I call this zero-centering). My question is this: how do I handle zero-centering the testing data before the matrix multiplication? Do I just subtract the means of the training data as I did to zero-center the training data? It would seem this is correct, since the other option I imagine is to use the mean of the testing data (in the case the testing data consist of a single instance, it is a 0 vector then), but maybe there are other options I am overlooking?

Can someone back me up that I just subtract the means of the training data from the test data and then multiply by the eigenvector matrix? Or refute me? Ideally provide a reference?

Best Answer

Do I just subtract the means of the training data as I did to zero-center the training data?

Yes.

You are supposed to do to the test data exactly the same transformation that you did to the training data; this includes centering -- it should be done using the mean values obtained on the training set. If you standardized the training set, then you would also divide your test set by the standard deviations obtained on the training set. After that, you can project your test set onto the PCs of the training set.

Related Question