Solved – Random Forest classwt

machine learningrrandom forestunbalanced-classes

I have a random forest algorithm that performs reasonably well.
I read here about the importance of classwt (the priors of the classes) and decided to try them out.
I have 18 columns, each with over 1000 data points and only 2 classes.
Class -1 is present about 75% of the time, while class 1 is the remaining 25%.

Now the questions:

  1. Why the priors don't need to add up to 1 ?
  2. I used the priors above (.75, .25) and the performance has been either same or significantly worse. Why classwt doesn't improve the performance ? Am I doing something wrong ?

Best Answer

1 - why classwt does not need to add to one

classwt vector is normalized once and for all on c-level.

    /* Normalize class weights. */
normClassWt(cl, nsample, nclass, *ipi, classwt, classFreq)

line169 v.4.6-12 - https://github.com/cran/randomForest/blob/master/src/rf.c

2 - why does my performance decrease when I add priors?

short story: you increased the prior the opposite way, than what you expected. Also you need a more clear idea of what your defined model performance metric is and what it should be. Check out this thread on practical implementation, and further links to an article comparing classwt against stratification. related thread(s) on cross-validated

First of all you need to define performance. Some performance measures can sometimes be gamed (classification accuracy, AUC of class acc, recovery etc) so choose your performance metric carefully. Since you have 75% of class '-1', I can simply predict every new observation to be a member of this class and I will achieve a classification accuracy of 75%. Here I 'game' the metric by changing the the cutoff threshold.

You can split the prediction performance into resolution and calibration. The former is the ability to rank predictions from most likely to least likely. I think AUC of the ROC-curve is a good overall measure of this performance. However, you may prefer sensitivity over specificity or the other way around, depending on your task at hand. Calibration is the models ability to predict the correct class probability for predictions in average given a true class probability. Models can separate classes well, although still be over or under-confident in their ability to do so. Maybe it is problematic a model predict 99.99% when it was only 95%. To only achieve good calibration is easy, you just predict any random new observation equal to base rate. 75%-25% in your case. Sometimes you don't care about exact probability estimation. You just pick predictions from the top. Well calibrated predictions are very useful, if you like to calculate what would be the impact of acting on such a prediction. Should you as a telephone company, call a costumer identified to leave, to offer her a better deal to stay? If your model is over confident you may offer a cheaper mobile plan for too costumers, although they would not have left. Breier score is an overall proper metric that both assesses both resolution and calibration, and it can only be improved by making better overall models. (Always be careful with a causal interpretation of observed relationships. In the phone example, estimations/assumptions on the given effect of intervention must be established also.

You got the priors part wrong. I see the manual is a little confusing on this part, as it just states classwt are the same as priors. Your default prior is yout base rate, the target class distribution of your training set. In the R implementation you can modify the prior either by classwt or sampsize/strata. The two approaches seem to perform equally well, in respect to resolution. For very unbalanced data sets starting at 1:10, then re-balancing can improve resolution. Your data set is only 1:4, so I don't expect much improvement.

If you disagree with the default prior represented the base rate of your training set, you can adjust the prior. I prefer sampsize/strata because it is easier and more transparent to adjust the prior. In your case you started at 75%-25%, what you did was to increase the class wt of dominant class over rare class. Thus you effectively adjusted the prior from $.75^2 /(.75^2+.25^2) = .9$ = 90% and vice versa 10%. This is only an approximation, but the point is you increased your expectation of the occurrence of class '-1' even beyound. When you likely test your performance with out-of-bag cross validated classification accuracy, this adjusted prior will lower performance score. Lastly you are free to do any correction of predictions after by any imaginable function. E.g. I play with the Elkan's correction, where I first stratify to increase resolution random forest model and then correct the raw predicted probabilities with Elkan's method to re-calibrate back to base rate.