Solved – Interpreting conflicting results from Random Forest & Logistic Regression

feature selectionlogisticmachine learningrandom forest

I am using SKLearn and Statsmodel in python to build a RF and Logistic Regression, respectively.

I have a feature that the RF indicates is important (feature importance of 0.202, closely behind #1 and #2 most important features).

However, in running the logistic regression, the coefficient associated with this same feature is 0.0009, nearly 0.

What is going on here? Why would splitting on this variable lead to higher information gain in the Random Forest, but contribute little to the logistic regression model?

Best Answer

The absolute value of the coefficient is not proportional to the importance of the corresponding feature. There are two ways to assess the significance of a given feature in logistic regression (and more generally for Generalized Linear Models):

  • Look at the p-value of this parameter in the output of the logistic regression
  • or: run two models, one with all the features except the feature of interest (the one you want to assess the performance), and run a second model with all the features, including the feature of interest. You can then perform a Likelihood Ratio Test between the two models to see if this feature is significant in the prediction task

The second approach is more reliable.

Related Question