Ridge or multiple linear regression following PCA

pcaregressionridge regression

I have a real world clinical dataset with a severe issue of p >> n. I have thus decided to run PCA before modelling the data. This leads to a dataset with 150 samples with 85 features.

I would like to now use regression to predict outcomes in both a cross-validated fashion as well as build a model with the training data to predict outcomes in an external validation dataset.

My question is: because I have done PCA (which from my understand would aid with the issue of collinearity), would ridge regression still be meaningful to use? I am getting very similar results in the cross-validation between the two, with ridge regression perhaps even performing slightly better (based on RMSE). My understanding is the regularization feature is helpful for predicting due to aiding with the issue of overfitting.

Any insight would be greatly appreciated.

Best Answer

85 predictor dimensions with only 150 samples is likely to lead to overfitting in clinical data, even though you now have p < n. You typically need 10-20 cases per predictor to avoid overfitting.

Ridge regression can be thought of as a continuous version of principal components regression (PCR). Ridge weights the principal components continuously rather than the all-or-none PC selection in PCR. In that sense, ridge is a superior solution to your problem over an unpenalized PCR, as the coefficient penalization imposed by ridge will minimize the overfitting that would occur if you just went along with 85 PCs and 150 samples. Performing ridge with all of your initial predictors will give you advantages of PCR without the overfitting disadvantage in this case.

There's a practical issue in external validation and other future uses in this scenario, as you will need data in those future data sets on all of the predictors used in building the model. That's true whether you use PCR or ridge. If you can't assure that, you might be better off with a different penalization method like elastic net or LASSO so that you only need to have a smaller number of predictor values available for later application.

A better solution overall might be to apply knowledge of the subject matter for a rational choice of candidate predictors. See Frank Harrell's course notes and book for guidance. If your predictors are things like expression values for thousands of genes that might not be possible, but in that case you should at least make sure to include known clinically relevant predictors, perhaps not penalized, in your model.

Related Question