Solved – Feature selection and parameter tuning with caret for random forest

caretfeature selectionrandom forestregression

I have data with a few thousand features and I want to do recursive feature selection (RFE) to remove uninformative ones. I do this with caret and RFE. However, I started thinking, if I want to get the best regression fit (random forest, for example), when should I perform parameter tuning (mtry for RF)? That is, as I understand caret trains RF repeatedly on different feature subsets with a fixed mtry. I suppose the optimum mtry should be found after the feature selection is finished, but will the mtry value that caret uses influence the selected subset of features? Using caret with low mtry is much faster, of course.

Hope someone can explain this to me.

Best Answer

One thing you might want to look into are regularized random forests, which are specifically designed for feature selection. This paper explains the concept, and how they differ from normal random forests

Feature Selection via Regularized Trees

There's also a CRAN package RRF that's build on the randomForest that will allow you to implement them easily in R. I've had good luck with this methodology myself.

Regarding your initial question, the only advice I can give is that if you have a lot of collinearity then you need to use smaller tree sizes. This allows the algorithm to determine importance with less interference from collinearity effects.