Solved – How to use principal components as predictors in linear regression

pcarregression

I followed the instructions from this open Stanford lecture on PCR.

I have a couple of questions, but first I'll post the code with my comments.

#Principal Components Regression
library('pls')

#split data in half
train <- sample(nrow(Boston_Ready), (nrow(Boston_Ready)/2))

#calculates pcr and uses cross-validation to determine the number of components
pcr.fit <- pcr(medv ~., data=Boston_Ready, subset=train, scale=TRUE, validation="CV")
summary(pcr.fit)

#plots MSE by number of components
validationplot(pcr.fit, val.type='MSEP')

#choose number of components from the validation plot
pcr.pred <- predict(pcr.fit, Boston_Ready[-train,], ncomp=5)

#Test MSE for the model
sqrt(mean((pcr.pred - Boston_Ready[-train,]$medv)^2))

#train on the whole dataset with the right number of components
pcr.fit_full <- pcr(medv ~., data=Boston_Ready, ncomp = 5, scale=TRUE)

My goal is to compare whether PCR creates a better model than linear regression.

I have a couple of questions. I hope that's alright.

1- Splitting my data into a train and test set seems costly. It is a small set with 500 rows and 14 variables.

Could I run PCR with all of the data and then approximate test error with cross validation?

2- Are there other visuals that are helpful for understanding PCR? I looked at the loadings to get an understanding of the components.

3- Which is my final model? Is it pcr.fit_full?

I ask because some tutorials included a part where they take the principal components and run a linear regression model where the components are the predictors.

Best Answer

  1. That dataset is plenty large. With cross-validation you usually split the dataset into k-partitions of test and training, where k is >= 2 but =< n (your sample size). At each iteration one partition is set aside for testing while the remaining data is used for training. This is then repeated till each partition has been used as a test set. The prediction error (in your case MSE, although RMSE is more easily interpreted and less prone to outliers) of the model is averaged. You can do that with the caret package in R which it automates the training, testing and error calculation procedure. It also lets you adjust tuning parameters to see which settings provide optimal results. You can use this package to fit linear regressions as well so that both models are compared in the same way. Here is a tutorial for caret to start you off.

  2. You want to check for multi-collinearity between variables - this needs to be addressed for PCR - an easy way to do this is to produce a pairs plot of your variables. You can also plot a PCA, since that is more or less what it is based off, and check that you don't have problem shapes like these.

  3. Following the above this question will be irrelevant. caret::train will produce a final model for you based on it's results.

Let me know if any thing is unclear! I am happy to help if you have more questions.