Significant Predictors – Detecting Significant Predictors out of Many Independent Variables

feature selectionpcaregressionstepwise regressionunderdetermined

In a dataset of two non-overlapping populations (patients & healthy, total $n=60$) I would like to find (out of $300$ independent variables) significant predictors for a continuous dependent variable. Correlation between predictors is present. I am interested in finding out if any of the predictors are related to the dependent variable "in reality" (rather than predicting the dependent variable as exactly as possible). As I got overwhelmed with the numerous possible approaches, I would like to ask for which approach is most recommended.

  • From my understanding stepwise inclusion or exclusion of predictors is not recommended

  • E.g. run a linear regression separately for every predictor and correct p-values for multiple comparison using FDR (probably very conservative?)

  • Principal-component regression: difficult to interpret as I won't be able to tell about the predictive power of individual predictors but only about the components.

  • any other suggestions?

Best Answer

I would recommend trying a glm with lasso regularization. This adds a penalty to the model for number of variables, and as you increase the penalty, the number of variables in the model will decrease.

You should use cross-validation to select the value of the penalty parameter. If you have R, I suggest using the glmnet package. Use alpha=1 for lasso regression, and alpha=0 for ridge regression. Setting a value between 0 and 1 will use a combination of lasso and ridge penalties, also know as the elastic net.

Related Question