Solved – Best practices for feature selection

dimensionality reductionfeature selectionpredictor

I have datasets that range from ~2000-9000 columns of predictor variables. I'm often charged with primarily classification – but sometimes regression tasks. I know that I don't need this many variables for my models to be effective but I can't anticipate which ones in a reliable way.

I'm looking for ideas on the general best practices that would cut this down to around 50-150 variables which from my experience, seems fairly effective in determining the outcome.

Currently I'm using lasso or random forest to whittle down the number of variables before running a final model. I want less variables so theres less noise, simply don't need that many, and to make it easier to deploy to production.

Best Answer

I think you are doing a "best practice" approach already for feature selection. Using a regularised regression approach like LASSO and complementing those insights with a distribution-free model like Random Forest to decide the most important features is probably the best way to go.

Some minor suggestions: I would propose using Elastic Net to potentially have a small amount of $L_2$ regularisation. This should make our feature selection a bit more stable in case of correlated features. Similarly, taking a slightly more sophisticated approach of using Random Forests within a full Recursive Feature Elimination framework like Baruta (See Nilsson et al. for background, CRAN link) instead of simply relying on simple Random Forest variable importance will probably be beneficial.

Having said the above, we should use such feature selection approaches only if we cannot work with our original full dataset and/or we have problem collecting the features in question in the future (eg. too costly). Using a modelling approach that can actively regularise the resulting model (eg. gradient boosting machines where we can regularise the fit by properly picking the learning rate, tree-depth, minimum number of children per leaf node, etc.) is the best way to go. In that way we know we are not reusing our data, as well as that we are not losing valuable information that was potentially missed out during our feature selection step.

An issue, not touched upon is performing data reduction using some dimensionality reduction technique like PCA, ICA, NNMF, etc. These techniques do not "select features" per se but rather "combine features" to create meta-features of variable informational value. They can be very useful if we need a small subset of "information-rich" features. Nevertheless, these "information-rich" features are not guarantee to include more, less or any information relevant to our modelling task so they are not a silver bullet for feature selection. They usually present a convenient and condensed representation of our original data when we cannot work with their raw form.