Solved – Random Forest: IncNodePurity and Feature Selection for Binary Logistic Regression

feature selectionginirrandom forest

After creating a Random Forest object using randomForest with around 500 candidate variables, I used importance(object) to display IncNodePurity for each of the candidate variables in relation to the binary outcome of interest (Payment/No Payment).

I am aware that IncNodePurity is the total decrease in node impurities, measured by the Gini Index from splitting on the variable, averaged over all trees. What I don't know is what should be the cutoff for candidate variables to be retained after making use of randomForest for feature selection in regards to binary logistic regression models. For example, the smallest IncNodePurity among my 498 variables is 0.03, whereas the largest IncNodePurity is 96.68.
In summary, I have one main question:

Is there a cutoff for IncNodePurity? If yes, what is it?

If no, how do you determine the cutoff? Do you simply take 10 candidate variables with the largest IncNodePurity if you want a model with only 10 predictor variables?

Any thoughts or references are greatly appreciated. Thanks!

Best Answer

I don't believe such a cutoff exists, although the variable importance plots can be informative. Carry out two experiments. Rerun the random forest and see how the list changes. Delete an observation and also observe. In my experience, the answer relates to what is the goal of feature selection. For example, why not use every variable to make predictions. The random forest can easily do that. We usually use feature selection for a reason, for example, seeking a rule using just a small number of features that can easily be measured in the future. In my case, the number was set by the technology I plan to use the diagnostic on. More importantly, if you are using feature selection, this has to be repeated at each iteration of cross-validation. Searching this site for "cross-validation", "feature selection", and "stepwise regression" will give you a start.