Solved – Why can variable importance be negative/zero while its correlation with the response variable is high

correlationfeature selectionimportancerandom forest

I don't have a working example for this, as I'm using a large dataset in R with the ranger package (Random Forest algorithm)

I fit a model using the ranger package with predictors $X_1,…,X_k$ and a response variable $Y$ with the purpose of looking at the variable importance of each predictor. After fitting the model, I calculated variable importance using the permutation method and importance().

One of the variables (say $X_1$) is highly correlated with the response variable $Y$ (~0.7), but based on the Random Forest model the variable importance of $X_1$ is negative! I would assume if a variable is highly correlated with the response, it would be seen as more important

I'm not sure if there's a simple explanation for this?

Thanks so much!

Best Answer

The feature importance is based on the features that were actually used in the decision trees, which is decided on some estimation of information gain (Gini,entropy etc). If the predictors are correlated with eachother, it can be that after splitting on for example $X_5$ there is no more information gain to be had from later also splitting on $X_1$. In this case the feature importance of $X_5$ will be high, and for $X_1$ very low or zero.

If you believe that $X_1$ might actually be a better/preferrable predictor, then leave $X_5$ out and run training again.

Related Question