Solved – Is feature importance in Random Forest useless

feature selectionimportancemachine learningrandom forestscikit learn

For Random Forests or XGBoost I understand how feature importance is calculated for example using the information gain or decrease in impurity.

In particular in sklearn (and also in other implementations) feature importance is normalized so that the total sum of importances across features sum up to 1.

But considering the following facts:

  1. Feature importance in random forest does not take into account co-dependence among features:
    For example, considering the extreme case of 2 features both strongly related to the target, no matter what, they will always end up with a feature importance score of about 0.5 each, whereas one would expect that both should score something close to one.

  2. Feature importance is always relative to the features set used and does not tell us anything about the statistical dependence between target and features.
    For example, considering the extreme case of a target and a set of randomly generated features, completely independent on the target, of course you would still be able to rank the features according to the feature importance metric but the result you get is meaningless in this case because you already know that all the features are independent on the target.

I did two examples where I knew the data generating of features and target and explained why feature importance in Random Forest is completely useless.

So my question are:

  1. if you are in a situation like 99.9% of the times where you don't know anything about the relationship between features and target how can you use this method to infer feature importance?

  2. i general instead of using just the decrease in impurity or the info gain in absolute terms wouldn't it be better to use relative measures like the ratio between decrease in impurity and total impurity so that the number would still be bounded between 0 and 1 (as it is now) but it would also reflect some sort of strength of association? (in my opinion it doesn't make any sense that the importances sum up to 1 in the first place)

Thank you for taking the time to read my question.

Best Answer

Standard Feature Importances simply tell you which ones of your features were more useful when building the model. They are not to be interpreted as a direct dependence between predictor and target.

As a consequence:

  • They are completely useless if your model is weak. If your model does not generalize to validation data - like in the case you mentioned of completely random predictors - then feature importances have no meaning. That is because all the splits are simply overfitting the training data and not capturing any real trend, so all the gini impurity you sum is useless
  • They are strongly influenced by correlated features. As you mentioned, it is a fact. Just know it and perform some good old feature engineering before hand to avoid having features that are too correlated
  • They are biased towards numerical and high cardinality features. This is definitely a problem. There are some alternative approaches to help relieve this

Therefore you MUST not interpret them as "correlations" or "strength coefficients", as they do not represent a dependency with the target. Yet, this does not mean at all that they are useless!
Some alternative approaches to limit the drawbacks are:

  • Permutation Importances: these are computed on VALIDATION data, and therefore solve that first overfitting issue. If a feature splits are overfitting on the training data, its importance will be low on test data. Moreover, as they are computed on a metric of your choice, they are easier to interpret and can in some sense be seen as a "strength coefficient", since they answer the question: "How much does the performance of my model degrade if I shuffle this predictor?". Boruta - which was mentioned in the comments - uses an algorithm that is based on this.
  • Unbiased Feature Importances: There are multiple works on this and the one linked is one of the newer ones. They are not yet implemented in major packages, but allow for a better measurement of importances that does not suffer from the above problems of overfitting
  • Oblivious Trees: This approach for building trees, which is used for example in catboost, forces all the splits on the same level of a tree to be done on the same feature. This forces splits on features that generalize better, and often gives importances that resent much less from overfitting the training.

Finally - feature importances are very useful and help discern important features form unimportant ones when using very powerful algorithms such as GBMs and RFs - however, they need to be used with care and interpreted the right way. At the same time, there are some alternatives and packages that solve some of the major flaws of classic feature importances, making them even more easy to use and interpret.