Solved – How to use a model generated from the principal components of a training data set to make predictions on a test data set

pcarregression

I've split my data set into a training and test set. I've performed a principal component analysis on the training set and have used the first 3 principal components to generate a logistic regression model for my response.

I now want to use this model to make predictions for my test data set and check if this is true.

I've been trying to use the predict function but obviously the model uses the principal components of the training set as the predictors whereas my test set just has all the original predictors so obviously they're not compatible.

How do I go about 'projecting' my test data onto the principal components I've already generated so I can use my model to make predictions?

Ideally I'd like to do this without using any external packages (it's for university). I am working in R.

Best Answer

You have to apply the same transformations you applied to your training set to your test set.

In this case, this mean you have to use the same parameters for standardization (assuming you did PCA on the correlation matrix) and rotation (the PCA rotation matrix).


If you're using prcomp (and assuming PCA on correlation matrix)

#Test index
s = sample(150, 30)

#Train data
x = iris[-s,-5]

#Principal components of train data
pr = prcomp(x, center = TRUE, scale = TRUE)

#Test data
y = iris[s,-5]

#Rotate standardized test data to the same space as train data
#You can also keep the first K columns in case you want to retain a X% ammount of variance
y = predict(pr, y)

This is what predict implicitly does:

#Test index
s = sample(150, 30)

#Train data
x2 = iris[-s,-5]

#Standardize train data
x2 = scale(x2)

#Principal components of train data
pr2 = prcomp(x2, center = FALSE, scale = FALSE)

#Test data
y2 = iris[s,-5]

#Standardize test data with the same parameters used on train data
y2 = scale(y2, center = attr(x2,"scaled:center"), scale = attr(x2,"scaled:scale"))

#Rotate standardized test data to the same space as train data
#You can also keep the first K columns in case you want to retain a X% ammount of variance
y2 = y2 %*% pr$rotation

The transformed data y and y2 should be identical (if you use the same indexes s!)

all.equal(y, y2)
#[1] TRUE