Solved – How to balance classification

classificationunbalanced-classes

I have a binary classification problem, where my training data is 70% positive labeled and 30% negative labelled. I use a logistic loss and it always classifies examples positive on the test data.

How can I make it classify some examples negative as well? One solution I though of might be to remove some positive examples in the training data so that it's 50% 50% pos neg labels. Another is to use a non-linear classifier because the problem might be non-linear.

Best Answer

If it is only 70%-30% there is probably no need to balance the dataset. The class imbalance problem is caused by not having enough patterns for the minority class, rather than a high ratio of positive to negative patterns. Generally, if you have enough data, the "class imbalance problem" doesn't arise. Also, note that if you artificially balance the dataset, you are implying an equal prior probability of positive and negative patterns. If that isn't true, your model may give bad predictions by over-predicting the minority class.

More importantly, there may be an overlap between classes such that the Bayes optimal decision is always to assign patterns to the positive class, in which case your model is doing exactly the right thing. Consider the case where there is one explanatory variable, which is distributed according to a standard normal distribution for both classes. In that case, as the positive class has a higher prior probability, the optimal model assigns all patterns to the positive class. Similar examples can be constructed where the class means are not the same, but the difference is small compared with the variation.

If classifying the majority class is a problem, that suggests that the misclassification costs of false-positive and false-negative costs are not the same. This can be built into the classifier by changing the threshold, rather than the model, as you are using a logistic loss.

Related Question