Solved – the difference between PCA + Linear Regression versus PCR

linearpcapythonregressionscikit learn

I am trying to do linear regression to predict the time a user spends listening to music using the following dataset:

enter image description here

My end goal is to know which characteristics or columns lead to higher listening. (Sum is the total listening time)

I was thinking of using PCA before linear regression because there were so many columns. However, as I googled that, I came across PCR and I'm not sure what the difference is. Does PCA even increase the results from Linear Regression? If yes, then is PCA + LR better or PCR better?

I am trying to do this using scikit learn's method for Linear Regression. It looks like it only takes in an X and Y data set for training. Does this mean I need to decrease my dimensionality of my input data matrix to 2 dimensions using PCA before I can use this method?

EDIT: Ended up using pandas and statsmodel to take in multiple inputs when doing linear regression in case that helps anyone else with the same question…However, my above questions still stand.

Best Answer

There is no difference.

Principal component regression (PCR) is linear regression after principal component analysis (PCA) is done on the set of predictors and (usually) only a small subset of principal components is retained.


For reading more about this topic, you might be interested in this thread How can top principal components retain the predictive power on a dependent variable (or even lead to better predictions)? and links therein.