Solved – Random Forest Variable selection

random forest

So I just started dealing with random forests and I have some questions regarding feature selection.

  1. Should I omit some features if they highly correlate with others? And if so – is there a common threshold or certain method generally used? I ask this because I feel I should since the random forest method picks random features at each node generation step and I assume there might be some bias otherwise.

  2. Should I omit features with near-zero variance (Let's say a feature that contains a "0" for 997 observations and a “1“ for 3 observations). I mean they're most likely not going to show up in any of the trained trees (or even if they do the variable importance will probably be quite low) but they're still noise and reduce the chance that another, maybe more valuable feature might be randomly chosen. Or does it usually not matter when generating 100 trees or more?

  3. There might be different random forest algorithms but does the random feature selection at each node usually work with replacement? I've read that the observations are randomly picked with replacement (every time a tree is built) but I cannot find an answer about randomly picking features with replacement.

I've seen similar questions but I'm still not certain about these things (or maybe I've overlooked some questions).

Help is much appreciated!

Best Answer

If you have a lot of observations and a lot of variables, performance shouldn't suffer much from such variables. But it is better to remove them if possible. The problem is identifying them in some manner that isn't just ad-hoc.

But there are a few algorithms for variable selection that can fine-tune things. This paper ranks and then performs a stepwise addition procedure to get the best subset.

This paper generates p-values for variable importances. Those with high p-values can be trimmed off, which can improve results.

Related Question