Solved – CART (rpart) balanced vs. unbalanced dataset

cartcategorical dataclassificationrpartunbalanced-classes

I am fitting a tree (CART) to the olives-dataset. The training data has 436 observations (test data: 136). I have 3 responses (the 'Region' variable) which splits the training data into 116 / 74 / 246 observations.

If I plot the variables eicosenoic and linoleic, I can see an almost perfect classification.

I used a balanced dataset with 74 observations for each response (btw, is that correct or should I use a smaller size than 74 observations?) and got almost the same prediction results of the testdata as for the unbalanced dataset.

That is why I am wondering if a balanced dataset is required in this case?
I assume that balancing is not requried but I am not sure and would like to know other opinions.

Best Answer

If you have well separated classes in the feature space it will not make much of a change on the predictions of the test data whether you have a balanced or an unbalanced training data set as long as you have enough data to identify the classes reasonably well.

If the class distributions of features overlap considerably its a different story. What the right thing to do is depends on your loss function and the class distribution in the future samples that you want to predict.

If the class distribution in future samples is approximately 0.26 / 0.18 / 0.56, as in the training data, and you use the 0-1-loss function to count the number of misclassifications, you will in general get a smaller number of misclassifications if you keep the training data unbalanced.

As a general comment I would always avoid actually throwing away data unless the training data set is huge. If you expect that future samples have a class distribution that differs from that of the training data I would try to incorporate that in the model instead. In a classification tree that could be done by weighting. If you use (naive) Bayes you can simply change prior class probabilities.