Solved – Random forest – binary classification vs. regression

classificationrrandom forestregression

I have a dataset that I'm trying to classify into 2 groups, A and B, using a random forest model. I know the true grouping and I'm trying to see how well I can model it using the other available variables. I've tried 2 different approaches that I thought would be equivalent, but which are actually giving me quite different results:

  1. Reading in the grouping as a (non-numeric) factor in R, growing a classification forest, and taking the proportion of trees that vote for group A as my prediction.
  2. Constructing an indicator variable for membership of group A, growing a regression forest, and taking the ensemble prediction as usual.

The split between the 2 groups is roughly 90-10 A vs. B. I'm growing 240 trees from ~200k observations of the same variables. I've left most of the settings at the defaults for the R randomForest package, but to keep the processing time down to a manageable level I've increased the node size to 200. The results are as follows:

  1. In the vast majority of cases, all 240 trees vote for A. The average predicted chance of any one observation being in A is about 99.9%. Worse still, not a single member of group B gets a majority of votes for group B!
  2. I get a wide range of predictions, with the mean prediction lying close to the observed mean of ~90%.

How can two apparently similar methods give such different results?

As for how I ended up trying this – I was initially trying to classify my dataset into a larger number of groups, of which B was one, but I noticed that B was being classified almost 100% incorrectly. The other groups are all much better behaved, even though most of them make up a far smaller proportion of my data.

Best Answer

Due to the class imbalance, you should have a look at the probabilities that your forests outputs (I'm not familiar with the random forest R package, but I think there is an option (type="prob") in the predict function that will give you a matrix of class probabilities.

I believe, the next thing to do with these probabilities is to derive a ROC curve and see if it performs better than the majority vote. In that case, it just means you should consider a 'soft' voting approach while optimising the threshold (based on the ROC curve) to determine the predicted class (which is straightforward in a binary case) instead of a 'majority' voting one.