Solved – Can random forest based feature selection method be used for multiple regression in machine learning

borutafeature selectionmachine learningmultiple regressionrandom forest

I would like to have a good feature selection method for a continuous response variable, over around 100 predictors. I would like to keep my model type as linear multiple regression model, rather than tree-based model.

My current method: I could calculate the (linear) correlation between each of my predictor and response, and select a subset of predictors with "strong" correlations for the final multiple regression. The prediction performance of the selected predictors will be then determined in this final multiple regression model. However, feature selection in this way is subjective, and I am afraid of missing "important" features.

I would like to apply a more objective and complete way of feature selection, such as "all-relevant" feature selection in Boruta or "variable of importance" in random forest. However, as I understand, both methods are based on tree-based random forest, which are not linear regression.

My questions are:

  1. Is my current method a proper way to handle my research purpose?

  2. Can random forest based feature selection handle feature selection purpose for multiple linear regression model?

  3. Are there any other feature selection methods recommended?

Best Answer

Firstly, a method that first looks at univariate correlations for pre-identifying things that should go into a final model, will tend to do badly for a number of reasons: ignoring model uncertainy (single selected model), using statistical significance/strength of correlation as a criterion to select (if it is about prediction, you should rather try to assess how much something helps for prediction - these are not necessarily the same thing), "falsely" identifying predictors in univariate correlations (i.e. another predictor is even better, but because the one you look at correlates a bit with it, it looks like it correlates pretty well with the outcome) and missing out on predictors (they may only show up/become clear once other ones are adjusted for).

Additionally, not wrapping this into any form of bootstrapping/cross-validation/whatever to get a realistic assessment of your model uncertainty is likely to mislead you.

Furthermore, treating continuous predictors as having linear effects can often be improved upon by methods that do not make such an assumption (e.g. RF).

Using RF as a pre-selection for a linear model is not such a good idea. Variable importance is really hard to interpret and it is really hard (or meaningless?) to set a cut-off on it. You do not know whether variable importance is about the variable itself or about interactions, plus you are losing out on non-linear transformations of variables.

It depends in part of what you want to do. If you want good predictions, maybe you should not care too much about whether your method is a traditional statistical model or not.

Of course, there are plenty of things like the elastic net, LASSO, Bayesian models with the horseshoe prior etc. that fit better into a traditional modeling framework and could also accomodate e.g. splines for continuous covariates.

Related Question