I have created a model for prediction of candidates presence or not . I have used Logistic Regression and Random Forest . By Logistic Regression, I got coefficients associated with 100 features and I sorted these by coefficient values, and I am assuming the values greater with the coefficients have a positive impact and negative values will have a negative impact.
But, when I did Random Forest and got the feature importance it is not same as I got from Logistic Regression and RF coefficients for features are not negative also. So, I was wondering based on the greater values that I got from Random Forest, can I interpret the impact of these variables or features as positive impact and negative impact .
Please help.
Best Answer
The short answer is No.
The long answer follows, for which I fit a random forest to demonstrate variable importance (a.k.a variable ranking):
Let's look at the class label distributions per each of the 4 numeric variables:
Focus on the bottom row of the figure (Species), which of the 4 variables carry more class discriminatory information?
Hopefully, you will answer the ones that correspond to subplots 3 and 4, i.e.
Petal.Length
andPetal.Width
.So, this is what the variable importance is capturing:
Take the
Petal.Length
variable for instance. TheMeanDecreaseAccuracy
column tells us that if we excludePetal.Length
from our classification exercise, the accuracy (max possible value 100) of our classification decreases by 37.700686. This is related to the concept of Mutual Information.If you focus on the column
MeanDecreaseGini
, this is another indicator of variable importance, which gives the average node impurity for the forest. This is measured by the Gini coefficient.I hope it is clear how these two measures are different from the coefficient estimates in a logistic regression. They do not signify positive or negative impact on the class label. They judge how much class discriminatory information each variable contains.
You can interpret that
Petal.Width
andPetal.Length
are the most useful variables for the classification task. Knowing these two variables for an observation (plant), decreases uncertainty and helps us to make more accurate predictions.One thing to be careful about is that, while coming up with the importances, this technique looks at the variables individually. In some cases, it may be that, for instance,
Sepal.Length
does not contain an awful lot of class discriminatory information on its own, but when combined withSepal.Width
, it does carry a lot of information. This is not the case here, but is worth keeping in mind.This last concept is discussed thoroughly in Sections 2.3 and 2.4 of this brilliant feature selection paper by Guyon et al.