Solved – important features in RandomForest – Sklearn

classificationmachine learningrandom forest

1) How to find important features in RandomForest classifier (in sklearn) with high statistical significant?

2) The input data I have is unbalanced which I simply repeat data to compensate that. When I do shuffling, some of these repeated data go to training and some goes to test which definitely increase the prediction accuracy. I know in reality it is not correct but what about if I only want to find important feature?

3) What is the meaning of a feature value (for example, 0.05) in randomforest? in Feature importance it is said that this value means 5% of data is classified correctly by this feature? To find the most important features can I sum up their values till become 0.9 and then I say 90% of data is classified correctly by this features?

Best Answer

The answer to this question describes how feature importances are computed in sklearn. Maybe it will help you with your questions #1 and #3.

Regarding question #1: It does not seem that this definition of importance is explicitly related to statistical significance.

Regarding question #2: You could still report the feature importances reported by sklearn, but they would be defined relative to your augmented data set and therefore seem more difficult to interpret. Using this augmented data set would at least change the "proportion of samples reaching that node" portion of the importance score equation (when compared with the original data set). By the way, I've had good results using SMOTE over-sampling of the minority class, combined with under-sampling of the majority class, when dealing with imbalanced data sets.

Regarding question #3: I'm not sure, but I don't think this definition of importance means that you could make the statement you make about 90% of the data being classified correctly by these features.