Solved – Random forest positive/negative feature importance

pythonrandom forestscikit learn

I have built a random forest regression model in sklearn. I can obtain a lists of features along with their importances. However, is there a way to determine whether these features have a positive or negative impact on the predicted variable?

Best Answer

I wrote a function (hack) that does something similar for classification (it could be amended for regression). The essence is that you can just sort features by importance and then consult the actual data to see what the positive and negative effects are, with the reservation that decision trees are nonlinear classifiers and therefore it's difficult to make statements about isolated feature effects.

If you're truly interested in the positive and negative effects of predictors, you might consider boosting (eg, GradientBoostingRegressor), which supposedly works well with stumps (max_depth=1). With stumps, you've got an additive model.

However, for random forest, you can get a general idea (the most important features are to the left):

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import sklearn.datasets
import pandas
import numpy as np
import pdb
from matplotlib import pyplot as plt
%matplotlib inline

data = sklearn.datasets.load_breast_cancer()
X, y = data.data, data.target
X = pandas.DataFrame(X, columns=data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(X, y)

forest = RandomForestClassifier().fit(X_train, y_train)
forest_prob = forest.predict_proba(X_test)[:,1]
importances = pandas.DataFrame(forest.feature_importances_, index=data.feature_names, columns=['importance'])

def forest_insight(X_val, y_val, thr, forest_prob, importances, nfeat):

    dec = map(lambda x: (x> thr)*1,forest_prob)
    val_c = X_val.copy()

    #scale features for visualization
    val_c = pandas.DataFrame(StandardScaler().fit_transform(val_c), columns=X_val.columns)

    val_c = val_c[importances.sort('importance', ascending=False).index[0:nfeat]]
    val_c['t']=y_val
    val_c['p']=dec
    val_c['err']=np.NAN

    val_c.loc[(val_c['t']==0)&(val_c['p']==1),'err'] = 3#'fp'
    val_c.loc[(val_c['t']==0)&(val_c['p']==0),'err'] = 2#'tn'
    val_c.loc[(val_c['t']==1)&(val_c['p']==1),'err'] = 1#'tp'
    val_c.loc[(val_c['t']==1)&(val_c['p']==0),'err'] = 4#'fn'

    n_fp = len(val_c.loc[(val_c['t']==0)&(val_c['p']==1),'err'])
    n_tn = len(val_c.loc[(val_c['t']==0)&(val_c['p']==0),'err'])
    n_tp = len(val_c.loc[(val_c['t']==1)&(val_c['p']==1),'err'])
    n_fn = len(val_c.loc[(val_c['t']==1)&(val_c['p']==0),'err'])

    fp = np.round(val_c[(val_c['t']==0)&(val_c['p']==1)].mean(),2)
    tn = np.round(val_c[(val_c['t']==0)&(val_c['p']==0)].mean(),2)
    tp =  np.round(val_c[(val_c['t']==1)&(val_c['p']==1)].mean(),2)
    fn =  np.round(val_c[(val_c['t']==1)&(val_c['p']==0)].mean(),2)


    c = pandas.concat([tp,fp,tn,fn],names=['tp','fp','tn','fn'],axis=1)
    pandas.set_option('display.max_colwidth',900)
    c = c[0:-3]

    c.columns = ['TP','FP','TN','FN']
    return c

Related Solutions

Solved – Class-specific feature importance

The simplest method I could think of is finding the ratio between the "positive" and "negative" rows where your feature is present. I.e. for all movies where Clint Eastwood is playing, what is the ratio between positive reviews and negative reviews? Are there many more positive than negative reviews?

I suppose there are different formulas you could use - pos/neg, pos/(pos+neg), pos-neg, etc.

Solved – Interpret the impact of variables like positive or negative on the model by Random Forest, as I can do by Logistic Regression

The short answer is No.

The long answer follows, for which I fit a random forest to demonstrate variable importance (a.k.a variable ranking):

if(!require('randomForest')) { install.packages("randomForest");  require("randomForest") } 

# Observe iris data
pairs(iris)

# Train & Test split
train = sample (1: nrow(iris ), nrow(iris )/2)
test=iris [-train ,"Species"]

rf.iris =randomForest(Species∼.,data=iris ,subset =train ,
                          mtry=3, importance =TRUE)

yhat.rf = predict (rf.iris, newdata = iris[-train ,])


confusion_matrix <- table(yhat.rf, test)

Let's look at the class label distributions per each of the 4 numeric variables:

pairs(iris)

enter image description here

Focus on the bottom row of the figure (Species), which of the 4 variables carry more class discriminatory information?

Hopefully, you will answer the ones that correspond to subplots 3 and 4, i.e. Petal.Length and Petal.Width.

So, this is what the variable importance is capturing:

var_importance <- importance (rf.iris )

               setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  0.00000  -3.658955  4.588084             2.529800        0.4303867
Sepal.Width   0.00000  -3.411590  1.133001            -1.061102        0.2859101
Petal.Length 23.26742  26.463392 34.734821            37.700686       24.2050973
Petal.Width  23.25556  23.387203 30.062981            33.186258       24.2027126

Take the Petal.Length variable for instance. The MeanDecreaseAccuracy column tells us that if we exclude Petal.Length from our classification exercise, the accuracy (max possible value 100) of our classification decreases by 37.700686. This is related to the concept of Mutual Information.

If you focus on the column MeanDecreaseGini, this is another indicator of variable importance, which gives the average node impurity for the forest. This is measured by the Gini coefficient.

I hope it is clear how these two measures are different from the coefficient estimates in a logistic regression. They do not signify positive or negative impact on the class label. They judge how much class discriminatory information each variable contains.

You can interpret that Petal.Width and Petal.Length are the most useful variables for the classification task. Knowing these two variables for an observation (plant), decreases uncertainty and helps us to make more accurate predictions.

One thing to be careful about is that, while coming up with the importances, this technique looks at the variables individually. In some cases, it may be that, for instance, Sepal.Length does not contain an awful lot of class discriminatory information on its own, but when combined with Sepal.Width, it does carry a lot of information. This is not the case here, but is worth keeping in mind.

This last concept is discussed thoroughly in Sections 2.3 and 2.4 of this brilliant feature selection paper by Guyon et al.

Best Answer

Related Solutions

Solved – Class-specific feature importance

Solved – Interpret the impact of variables like positive or negative on the model by Random Forest, as I can do by Logistic Regression

Related Question