Solved – determining how “important” a feature is in predicting a target in decision trees

feature selectionimportancemachine learningrandom forest

Random forests allow us to compute a heuristic for determining how
"important" a feature is in predicting a target. This heuristic
measures the change in prediction accuracy if we take a given feature
and permute (scramble) it across the datapoints in the training set.
The more the accuracy drops when the feature is permuted, the more
"important" we can conclude the feature is. Importance can be a useful
way to select a small number of features for visualization.

I have read about this method to select the important features in random forest. Can some one explain me what does scrambling a feature means in random forest??

Best Answer

Suppose you have 1000 men and 1000 women, measure everyone's height and model height as predicted by sex. In your random forest context, you can assess predictive accuracy out-of-bag.

Now, if sex were irrelevant to height, then you should get (roughly) the same predictive accuracy if you randomly shuffled everyone's M/F identification.

That is exactly what RF variable importance does: shuffle each observation's value on one variable, reassess predictive accuracy and compare to accuracy on the unshuffled data. If the shuffled data predict as well as unshuffled data, the variable is obviously not very important for prediction. If the shuffled data predict worse, then the variable is important.

In this toy example, we only have one predictor. In a more general setting, you would do this on every predictor separately, that is, you'd shuffle one variable, leaving the others unchanged.

Related Question