Solved – random forests: feature importance changes with each run

machine learningrandom forest

  1. I have 309 samples [tumors] and 3234 features [genes]. I used scikit-learn python library to run random forest with one parameter n_estimators=100.
  2. I also used train_test_split to spit my dataset into 70-30.
  3. When I run the model several times, i.e, each time – randomly split data as 70-30 and predict feature importances; I get different features ranked as important. Sometimes there is an overlap while most of the times there is none – often new features that were NOT found in the original run show up.
  4. Also the most top 10 ranked feature scores are within 0.01-0.03.
  5. There are highly correlated features in this dataset (as genes are often co-regulated – or, features are often inter-linked in an organic network).

Is this commonly noted and if so, are there ways to come to a consensus important feature by averaging the "feature_importance_score" across 10-20 random runs?

If this is not common, any suggestions where I may be going wrong?

Best Answer

I'm right now using R to do something similar.

So if there is a huge change in feature importance in RandomForest this is due to random-number-generation influence. Maybe the 70-30 split is not always the same. And depending on the RandomForest implementation it will use randomness for training.

Your training data sounds like it could overfit, if you have just 309 observations with 3234 features each. Maybe you can get more data?

I think your approach is fine. It can work. But maybe the result is just, that all features are pretty much the same important on the given training set - and then you need a different approach.

Related Question