Solved – Interpreting conflicting results from Random Forest & Logistic Regression

feature selectionlogisticmachine learningrandom forest

I am using SKLearn and Statsmodel in python to build a RF and Logistic Regression, respectively.

I have a feature that the RF indicates is important (feature importance of 0.202, closely behind #1 and #2 most important features).

However, in running the logistic regression, the coefficient associated with this same feature is 0.0009, nearly 0.

What is going on here? Why would splitting on this variable lead to higher information gain in the Random Forest, but contribute little to the logistic regression model?

Best Answer

The absolute value of the coefficient is not proportional to the importance of the corresponding feature. There are two ways to assess the significance of a given feature in logistic regression (and more generally for Generalized Linear Models):

Look at the p-value of this parameter in the output of the logistic regression
or: run two models, one with all the features except the feature of interest (the one you want to assess the performance), and run a second model with all the features, including the feature of interest. You can then perform a Likelihood Ratio Test between the two models to see if this feature is significant in the prediction task

The second approach is more reliable.

Related Solutions

Solved – Interpret the impact of variables like positive or negative on the model by Random Forest, as I can do by Logistic Regression

The short answer is No.

The long answer follows, for which I fit a random forest to demonstrate variable importance (a.k.a variable ranking):

if(!require('randomForest')) { install.packages("randomForest");  require("randomForest") } 

# Observe iris data
pairs(iris)

# Train & Test split
train = sample (1: nrow(iris ), nrow(iris )/2)
test=iris [-train ,"Species"]

rf.iris =randomForest(Species∼.,data=iris ,subset =train ,
                          mtry=3, importance =TRUE)

yhat.rf = predict (rf.iris, newdata = iris[-train ,])


confusion_matrix <- table(yhat.rf, test)

Let's look at the class label distributions per each of the 4 numeric variables:

pairs(iris)

enter image description here

Focus on the bottom row of the figure (Species), which of the 4 variables carry more class discriminatory information?

Hopefully, you will answer the ones that correspond to subplots 3 and 4, i.e. Petal.Length and Petal.Width.

So, this is what the variable importance is capturing:

var_importance <- importance (rf.iris )

               setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  0.00000  -3.658955  4.588084             2.529800        0.4303867
Sepal.Width   0.00000  -3.411590  1.133001            -1.061102        0.2859101
Petal.Length 23.26742  26.463392 34.734821            37.700686       24.2050973
Petal.Width  23.25556  23.387203 30.062981            33.186258       24.2027126

Take the Petal.Length variable for instance. The MeanDecreaseAccuracy column tells us that if we exclude Petal.Length from our classification exercise, the accuracy (max possible value 100) of our classification decreases by 37.700686. This is related to the concept of Mutual Information.

If you focus on the column MeanDecreaseGini, this is another indicator of variable importance, which gives the average node impurity for the forest. This is measured by the Gini coefficient.

I hope it is clear how these two measures are different from the coefficient estimates in a logistic regression. They do not signify positive or negative impact on the class label. They judge how much class discriminatory information each variable contains.

You can interpret that Petal.Width and Petal.Length are the most useful variables for the classification task. Knowing these two variables for an observation (plant), decreases uncertainty and helps us to make more accurate predictions.

One thing to be careful about is that, while coming up with the importances, this technique looks at the variables individually. In some cases, it may be that, for instance, Sepal.Length does not contain an awful lot of class discriminatory information on its own, but when combined with Sepal.Width, it does carry a lot of information. This is not the case here, but is worth keeping in mind.

This last concept is discussed thoroughly in Sections 2.3 and 2.4 of this brilliant feature selection paper by Guyon et al.

Python – Determining Feature Importance in Random Forest Classification

Variable importance accounts for the increase in out-of-bag cross-validated prediction error. It would be possible but not meaningful to account for the change of prediction error by one sample only. As one sample only can be correctly or wrongly predicted, such a term would be very unstable and crude.

You could check out 'local variable importance', 'partial dependence plots' or 'feature contributions'. Here's an example from my package forestFloor using feature contributions. Each plot shows the change of predicted class probability as function each variable. For the iris data set, there no strong variable interactions. Therefore, the model structure can be boiled down to a 2D visualization. The R-sqaured terms quantifies how much the model structure deviates from this main effect only interpretation/visualization.

library(forestFloor)
library(randomForest)
data(iris)
X = iris[,!names(iris) %in% "Species"]
Y = iris[,"Species"]

rf = randomForest(X,Y,
                  keep.forest=TRUE, #mandatory for classification
                  replace=FALSE,    #if TRUE use trimTrees::cinbag, not randomForest
                  keep.inbag=TRUE,  #mandatory always for forestFloor
                  sampsize =15 )    #optional:smaller trees smoother model structure

ff = forestFloor(rf.fit  = rf,           # mandatory
                 X       = X,            # mandatory
                 calc_np = "sad monkey", # this input takes no effect for classification
                 binary_reg = FALSE)     # can change two class classification to regression
# Thus cannot be TRUE for IRIS (three class)

plot(ff,plot_GOF=TRUE,cex=.7,
     colLists=list(c("#FF0000A5"),
                   c("#00FF0050"),
                   c("#0000FF35")))

Best Answer

Related Solutions

Solved – Interpret the impact of variables like positive or negative on the model by Random Forest, as I can do by Logistic Regression

Python – Determining Feature Importance in Random Forest Classification

Related Question