Solved – Random forest positive/negative feature importance

pythonrandom forestscikit learn

I have built a random forest regression model in sklearn. I can obtain a lists of features along with their importances. However, is there a way to determine whether these features have a positive or negative impact on the predicted variable?

Best Answer

I wrote a function (hack) that does something similar for classification (it could be amended for regression). The essence is that you can just sort features by importance and then consult the actual data to see what the positive and negative effects are, with the reservation that decision trees are nonlinear classifiers and therefore it's difficult to make statements about isolated feature effects.

If you're truly interested in the positive and negative effects of predictors, you might consider boosting (eg, GradientBoostingRegressor), which supposedly works well with stumps (max_depth=1). With stumps, you've got an additive model.

However, for random forest, you can get a general idea (the most important features are to the left):

enter image description here

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import sklearn.datasets
import pandas
import numpy as np
import pdb
from matplotlib import pyplot as plt
%matplotlib inline

data = sklearn.datasets.load_breast_cancer()
X, y = data.data, data.target
X = pandas.DataFrame(X, columns=data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(X, y)

forest = RandomForestClassifier().fit(X_train, y_train)
forest_prob = forest.predict_proba(X_test)[:,1]
importances = pandas.DataFrame(forest.feature_importances_, index=data.feature_names, columns=['importance'])

def forest_insight(X_val, y_val, thr, forest_prob, importances, nfeat):

    dec = map(lambda x: (x> thr)*1,forest_prob)
    val_c = X_val.copy()

    #scale features for visualization
    val_c = pandas.DataFrame(StandardScaler().fit_transform(val_c), columns=X_val.columns)

    val_c = val_c[importances.sort('importance', ascending=False).index[0:nfeat]]
    val_c['t']=y_val
    val_c['p']=dec
    val_c['err']=np.NAN

    val_c.loc[(val_c['t']==0)&(val_c['p']==1),'err'] = 3#'fp'
    val_c.loc[(val_c['t']==0)&(val_c['p']==0),'err'] = 2#'tn'
    val_c.loc[(val_c['t']==1)&(val_c['p']==1),'err'] = 1#'tp'
    val_c.loc[(val_c['t']==1)&(val_c['p']==0),'err'] = 4#'fn'

    n_fp = len(val_c.loc[(val_c['t']==0)&(val_c['p']==1),'err'])
    n_tn = len(val_c.loc[(val_c['t']==0)&(val_c['p']==0),'err'])
    n_tp = len(val_c.loc[(val_c['t']==1)&(val_c['p']==1),'err'])
    n_fn = len(val_c.loc[(val_c['t']==1)&(val_c['p']==0),'err'])

    fp = np.round(val_c[(val_c['t']==0)&(val_c['p']==1)].mean(),2)
    tn = np.round(val_c[(val_c['t']==0)&(val_c['p']==0)].mean(),2)
    tp =  np.round(val_c[(val_c['t']==1)&(val_c['p']==1)].mean(),2)
    fn =  np.round(val_c[(val_c['t']==1)&(val_c['p']==0)].mean(),2)


    c = pandas.concat([tp,fp,tn,fn],names=['tp','fp','tn','fn'],axis=1)
    pandas.set_option('display.max_colwidth',900)
    c = c[0:-3]

    c.columns = ['TP','FP','TN','FN']
    return c