Solved – Summing feature importance in Scikit-learn for a set of features

classificationmachine learningrandom forestscikit learn

I have built a random forest using a set of features (~100), and I want to compare the feature importance for two subsets of features. In scikit-learn, the feature importance sums to 1 for all features, in comparison to R which provides the unbounded MeanDecreaseGini, see related thread Relative importance of a set of predictors in a random forests classification in R. My question is, is it possible to simply sum the feature importance of a set of features, or should one do similar to the R solution and use some weighted average?

I have used the Gini-impurity as a splitting criteria, and how RF uses that measure to estimate the feature importance is unclear to me.

Best Answer

TL,DR: yes, this is totally correct to sum importances over sets of features.

In scikit-learn, importance of a node $j$ in a single decision tree is computed (source code) as: $$ ni_j = w_j C_j - w_{left(j)}C_{left(j)}- w_{right(j)}C_{right(j)} $$ where $w_j$ is the weighted number of samples in node $j$ as a fraction of total weighted number of samples, $C_j$ is the impurity in node $j$, and $left(j)$ and $right(j)$ are its respective children nodes.

Then feature importance of feature $i$ is computed as: $$ fi_i = \frac{\sum_{j : \text{node j splits on feature i}} ni_j}{\sum_{j \in \text{all nodes}} ni_j} $$ In RandomForest or GradientBoosting, feature importances are then averaged over all the trees (source code).

In short, (un-normalized) feature importance of a feature is a sum of importances of the corresponding nodes. So if you take a set of features, it would be totally consistent to represent the importance of this set as sum of importances of all the corresponding nodes. And the latter exactly equals sum of individual feature importances. And normailizing denominator is the same for all the features, so does not change relative importances.

An simple example:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
rf = RandomForestClassifier(random_state=42).fit(iris.data, iris.target)
print(rf.feature_importances_)
sepal_features = [0, 1]
petal_features = [2, 3]
print(sum(rf.feature_importances_[sepal_features]))
print(sum(rf.feature_importances_[petal_features]))

It will give the following output:

[ 0.1292683   0.01582194  0.4447404   0.41016935]
0.145090242144
0.854909757856

From this, you can judge that petal features contributed to 85% of predictive pover of your random forest, and sepal features only to 15%. If your features are not much correlated, these numbers are meaningful.