Solved – Testing Classification on Oversampled Imbalance Data

classificationdatasetoversamplingresamplingunbalanced-classes

I am working on severely imbalanced data. In literature, several methods are used to re-balance the data using re-sampling (over- or under-sampling). Two good approaches are:

  • SMOTE: Synthetic Minority Over-sampling TEchnique (SMOTE)

  • ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN)

I have implemented ADASYN because its adaptive nature and ease to extension to multi-class problems.

My question is how to test the oversampled data produced by ADASYN (or any other oversampling methods). It is not clear in the mentioned two paper how they performed their experiments. There are two scenarios:

1- Oversample the whole dataset, then split it to training and testing sets (or cross validation).

2- After splitting the original dataset, perform oversampling on the training set only and test on the original data test set (could be performed with cross validation).

In the first case the results are much better than without oversampling, but I am concerned if there is overfitting. While in the second case the results are slightly better than without oversampling and much worse than the first case. But the concern with the second case is if all minority class samples goes to the testing set, then no benefit will be achieved with oversampling.

I am not sure if there are any other settings to test such data.

Best Answer

A few comments:

The option (1) is a very bad idea. Copies of the same point may end up in both the training and test sets. This allows the classifier to cheat, because when trying to make predictions on the test set the classifier will already have seen identical points in the train set. The whole point of having a test set and a train set is that the test set should be independent of the train set.

The option (2) is honest. If you don't have enough data, you could try using $k$-fold cross validation. For example, you could divide your data into 10 folds. Then, for each fold individually, use that fold as the test set and the remaining 9 folds as a train set. You can then average training accuracy over the 10 runs. The point of this method is that since only 1/10 of your data is in the test set, it is unlikely that all your minority class samples end up in the test set.

Related Question