Solved – Combining principal component regression and stepwise regression

pcaregressionstepwise regression

I want to use a combination of principal component analysis (PCA) and stepwise regression to develop a predictor model. I have 5 independent variables (which are correlated among each other to different extents, i.e. multicollinear) and 1 dependent variable. The step that I plan to follow is:

  1. Conduct principal component analysis on the independent variables and reconstruct the independent variables using only the largest 2-3 components.
  2. Conduct stepwise regression between the dependent variable and the reconstructed independent variables.

My questions are:

  1. Does the above procedure make sense?
  2. Does the reconstruction of independent variables in step 1 reduce the multicollinearity? If not, what can be done to remove/reduce multicollinearity?
  3. If the multicollinearity among the dependent variables is somehow removed, does it make sense to use the interaction terms in the stepwise regression?

Best Answer

Regression based on principal components analysis (PCA) of the independent variables is certainly a way to approach this problem; see this Cross Validated page for one extensive discussion of pros and cons, with links to further related topics. I don't see the point of the regression you propose after choosing the largest components. The "reconstructed" independent variables might suffer from being too highly dependent on the particular sample on which you based the model, and stepwise selection is generally not a good idea. Cross-validation would be a better way to choose the number of components to retain, finding the number of components that minimizes cross-validation error.

In your situation, with only 5 predictors you might be just as well served by a standard linear model. Unless you have extremely high correlations among some of your variables, you are unlikely to have the numerical instability issues that can arise in extreme cases. (And if you do have two very highly correlated predictors, you should consider using your knowledge of the subject matter rather than an automated approach to choose one.) Paying attention to model diagnostics will help determine whether the linear model is reasonable.

A standard regression model provides easier-to-interpret coefficients and might be easier to explain to others than PCA. For predictions from a linear model you should consider including all 5 independent variables (even those that aren't "statistically significant"), both because of the limitations of stepwise selection and because the relations of some predictors to the dependent variable will differ if other predictors are removed.

If you have very high co-linearity in a standard linear regression then it should show up in high errors associated with the corresponding coefficients, and you might consider approaches noted here like ridge regression to get useful information from all your predictors without overfitting. Ridge regression can be considered as a continuous version of the PCA-regression approach, where principal components are differentially weighted instead of being completely either in or out of the final model; see section 3.5 of Elements of Statistical Learning.

For your second and third questions:

The first page I linked above does a pretty good job of addressing your second question. Yes, choosing a limited number of principal components can help in reducing the problems associated with co-linearity, as the co-linear variables will tend to enter the same principal components together. Two warnings: the predictors should be standardized so that differences in scales don't drive the construction of the principal components, and there's no assurance that the components that capture the greatest variation in the predictors will be those most closely related to the dependent variable.

With respect to your third question, a stepwise approach is inappropriate, as you recognize. I don't see a reason why you couldn't include interaction terms among your selected principal components in a regression, but they would be extremely hard to interpret. That's another reason why I would lean here toward working with the original independent variables rather than with their transformations into principal components.

You seem very interested in using PCA for this predictive model, but remember that it's easy to get fixated on a particular approach. You are in a very good position to compare several approaches, combined with appropriate cross-validation or bootstrapping techniques, to see which works best for your particular needs. If that ends up being PCA that's good, but don't dismiss the other possibilities out of hand.

Related Question