Solved – Feature selection with Random Forests

feature selectionpythonrandom forest

I have a dataset with mostly financial variables (120 features, 4k examples) which are mostly highly correlated and very noisy (technical indicators, for example) so I would like to select about max 20-30 for later use with model training (binary classification – increase / decrease).

I was thinking about using random forests for feature ranking. Is it a good idea to use them recursively? For example, let's say in the first round I drop the worst 20%, second too and so on until I get the desired number of features. Should I use cross-validation with RF? (It's intuitive for me not to use CV because that's pretty much what RF does already.)

Also if I go with random forests should I use them as classifiers for the binary or regressor for the actual increase / decrease to get feature importances?

By the way, the models I would like to try after feature selection are: SVM, neural nets, locally weighted regressions, and random forest. I'm mainly working in Python.

Best Answer

For feature selection, we need a scoring function as well as a search method to optimize the scoring function.

You may use RF as a feature ranking method if you define some relevant importance score. RF will select features based on random with replacement method and group every subset in a separate subspace (called random subspace). One scoring function of importance could be based on assigning the accuracy of every tree for every feature in that random subspace. Then, you do this for every separate tree. Since, the source of generating the subspaces is random, you may put a threshold for computing the importance score.

Summary:

Step1: If feature X2 appears in 25% of the trees, then, score it. Otherwise, do not consider ranking the feature because we do not have sufficient information about its performance

Step2: Now, assign the performance score of every tree in which X2 appears to X2 and average the score. For example: perf(Tree1) = 0.85 perf(Tree2) = 0.70 perf(Tree3) = 0.30

Then, the importance of feature X2 = (0.85+0.70+0.30)/3 = 0.6167

You may consider a more advanced setting by including the split depth of the feature or the information gain value in the decision tree. There can be many ways to design a scoring function based on decision trees and RF.

Regarding the search method, your recursive method seems reasonable as a way to select the top ranked ones.

Finally, you may use RF either as a classifier or a regression model in selecting your features since both of them would supply you with a performance score. The score is indicative as it is based on the out-of-bag OOB samples and you may not consider cross-validation in a simpler setting.