classification – How to Use Lasso Before Random Forest for Feature Selection

classificationfeature selectionrandom forest

I have a small dataset with 160 obs. and >50 candidate predictors, some of which are correlated.
I would like to fit a classification model for prediction, and to identify the best set of predictors.
I was thinking to use a Lasso Regression for feature elimination and then fit a random forest with the selected covariates.
Would this be legitimate, or should I use the random forest directly?
Do you think that some other classification algorithm would be more suitable (eg boosting trees?)
Thank you.

Best Answer

The algorithm is a feature selection method that is based on random forest. The reason you might care is that lasso is a linear model, so anything that matters for your outcome that isn't linear in the parameters under estimation is at risk of getting eliminated. An example: What are disadvantages of using the lasso for variable selection for regression? In other words, the disadvantage to lasso for feature selection of a nonlinear model is that the lasso assumes a more restrictive set of assumptions than the random forest.

Additionally, one might wonder why you care about feature selection at all. Random forest has a kind of feature selection built in, in the sense that it selects the best features to split on when building trees (but this isn't foolproof).

In the particular case you outline, where you have a small number of observations overall, a random forest is probably not the right tool. Overfitting is a very real risk. Regularized regression might be the best you can do.

Related Question