Solved – Does class balancing introduce bias

cross-validationgaussian processk nearest neighbourmachine learningunbalanced-classes

I have a data set that is imbalanced, the prediction rate is not much better than the base line without doing any class balance. I have two classes and I can't collect more data.

What I have done:

  • Random Undersampling performs bad because it decreases the data set greatly.
  • Oversampling with SMOTE performs slightly better than baseline (Around 60-70%)
  • Oversampling with SMOTE and then Undersampling with ENN improves accuracy to 95% for some classifiers (KNN and Gausian process performing the best)

For the above reasons I am skeptical of my methodology and that there is no bias introduced.

My main questions are the following:

  • I have seen people doing class balancing before spliting training and test data. In the case of oversampling if you are doing it before splitting you would be adding bias to the accuracy because you will be comparing a data-point with it's oversampled counterpart (which were generated artificially). Does oversampling before spliting introduce bias?
  • Is it more scientifically correct to oversample after spliting the training and test set individually?
  • Do we have to balance the test set?
  • ENN and SMOTE are widely cited papers. Both of them use KNN as the main method for oversampling and undersampling. If I am then using a KNN Classifier I would expect some underlying bias. I understand they use randomization element and the literature is strong, but Could ENN and/or SMOTE introduce bias for specific classifiers?
  • How do we know if our dataset is unbalanced. e.g. 5% disparity in class could affect in different ways 10,000 data points few hundred data points, 2 classes and 20 classes. Is there a way to quantitatively determine our dataset is unbalanced?

Best Answer

For the main question:

Does class balancing introduce bias?

Yes, in most cases it does. Since the new data points are generated from the old ones, they can't introduce much variance to the dataset. In most cases they are only slightly different than the original ones.

Does oversampling before spliting introduce bias?

Yes, and this is why you should perform the splitting before balancing the training set. You want your test set to be as unbiased as possible in order to get an objective evaluation of the model's performance. If balancing was performed before splitting the datasets, the model might have seen information on the test set, during training, through the generated data points.

Is it more scientifically correct to oversample after spliting the training and test set individually?

You shouldn't over-sample the test set. The test set should be as objective as possible. By generating new test set data and evaluating your model on those, the procedure would lose its objectivity.

Do we have to balance the test set?

No, you shouldn't under any condition balance the test set.

Could ENN and/or SMOTE introduce bias for specific classifiers?

I don't think that k-NN or any other specific classifier would be more biased to the test set than the others. I'm not sure about this, though.

Related Question