Solved – Unstable variable importance ranking

random forest

I am new to R and and random regression forest. Right now I am working with a dataset of 60 input variables (dummy variables and continuous variables) and try to find the most important variables, which describe my dependent variable best. Therefore I am using the permutation-based OOB-MSE.

My problem is now, that each time I run the random forest, the ranking of the variables changes even when I just repeat the command with the same amount of ntree and mtry. First I was thinking that it had something to do with the randomized number of variables used in each tree, so I increased ntree. However, this also didn't help… Has anybody an idea what the problem might be and how I can get stable results for the variable importance measure?

Thanks!

Best Answer

There are a few things that can effect importance.

If the variables in your data set are correlated there can be a lot of instability in the variable importance as the model can use the variables somewhat interchangeably. Ideally it will spread the importance over all of the correlated variables but in practice it may require a lot of trees for this to happen. Reducing mTry or using extra/totally randomized trees is one way to combat this though this may hurt you're prediction accuracy or at least require re tuning...the most accurate model may not be the best for identifying feature importance.

Masking scores and other methods for explicitly dealing with correlated features have also been proposed.

You could also try doing dimensionality reduction before building you're models but this may destroy some of the non linear etc structure in the data depending on how you do it.

There are also biases in CART style feature selection towards "high cardinality" or sparse features. These features tend to be able to produce decreases in impurity that don't generalize well by random chance.

Further I'd only ever expect the importance ranking to be consistent for the top few features which get used across most of the bagged trees. The less useful features will get used less and have a lot more variability.

[2]: Also here: https://eranraviv.com/random-forest-importance-measures-are-not-important/