Solved – Should I first oversample or standardize (when cross-validating on imbalanced data)

classificationcross-validationmodel-evaluationoversamplingstandardization

I have an imbalanced (two-class) classification dataset, based on which I am trying to train and cross-validate a classifier.

During the process of the k-fold cross-validation, I set aside the test subsets before I oversample the remaining (training) subsets. I also standardize (zscore) the training subsets, and I store the Mu and Sigma to be later used in standardizing the test subsets.

The problem is that I do not know whether I should do the oversampling first or the standardization first! Any recommendations/suggestions would be appreciated.

Best Answer

The question is, why do you oversample?

  • If you have different relative frequencies in your data than you expect in the real application and oversampling is to correct this - then oversampling should be done first (or, to put it differently, you calculated weighted mean and standard deviation, and train a classifier for the corrected prior probabilities).

  • If you oversample "only" because you have imbalanced classes and try to generate a balanced data set, then IMHO you need to think twice whether this oversampling is good idea at all: there's no use in a classifier that is optimized for balanced classes if in reality the classes are just as imbalanced as your data.

  • I take from your question that you are already aware that splitting of training and test sets needs to be done first, so cases are independent.

Related Question