In a dataset of two non-overlapping populations (patients & healthy, total $n=60$) I would like to find (out of $300$ independent variables) significant predictors for a continuous dependent variable. Correlation between predictors is present. I am interested in finding out if any of the predictors are related to the dependent variable "in reality" (rather than predicting the dependent variable as exactly as possible). As I got overwhelmed with the numerous possible approaches, I would like to ask for which approach is most recommended.
-
From my understanding stepwise inclusion or exclusion of predictors is not recommended
-
E.g. run a linear regression separately for every predictor and correct p-values for multiple comparison using FDR (probably very conservative?)
-
Principal-component regression: difficult to interpret as I won't be able to tell about the predictive power of individual predictors but only about the components.
-
any other suggestions?
Best Answer
I would recommend trying a glm with lasso regularization. This adds a penalty to the model for number of variables, and as you increase the penalty, the number of variables in the model will decrease.
You should use cross-validation to select the value of the penalty parameter. If you have R, I suggest using the glmnet package. Use
alpha=1
for lasso regression, andalpha=0
for ridge regression. Setting a value between 0 and 1 will use a combination of lasso and ridge penalties, also know as the elastic net.