Solved – Getting Classification Direction for Predictors in RandomForest

classificationrandom forest

I'm using R to run RandomForest and generate a Variable Importance Plot from Classification. However, I am interested in understanding the direction that the important variables help to explain. For this question, I'll use the Titanic results presented here : http://trevorstephens.com/kaggle-titanic-tutorial/r-part-5-random-forests/

Here, they were trying to determine predictors of Survival or Non-Survival on the Titanic, and found that the top predictors that produced the greatest Mean Decrease in Accuracy were title, fare, Pclass, etc. In the instance of "Fare", we could likely guess that higher fare predicts higher survival, but is there a good way of testing this statistically? I've run t-tests or chi-square tests to see if these variables are significantly different between my classification groups (and if so, in what direction), and often they are not. Are there other methods of understanding the role of the top predictors from RF?

Best Answer

Diagnostics in a random forest typically try to measure the information content of each feature. Directionality (e.g. if this feature goes up then probability of such-and-such a class goes up) does not fit well the way decision trees and forests work. The notion that each feature has a direction of influence is likely a hold-over from linear models in which such a concept does make sense.

A decision tree can be viewed as chopping feature-space up in to a set of hyper-rectangles. Each hyper-rectangular region is assigned a value (e.g. class prevalence). That is, a decision tree creates a piece-wise constant function with hyper-rectangular shaped pieces. There is no constraint that such regions have any kind of ordering to the assigned value. So, if you imagine moving along one axis, as you pass from one region to another the value may jump down or up, and then jump again (possibly in a different direction) as you move into the next region. A random forest can be thought of in the same way, except that everything is somehow 'smoothed' by averaging a bunch of piece-wise constant functions.

If you really want to understand the relationship between a single feature and the predicted value in isolation, then I suggest just looking at the marginal distribution. This of course has the significant shortcoming of not considering multivariate interactions.

Related Question