Solved – Should I first oversample or standardize (when cross-validating on imbalanced data)

classificationcross-validationmodel-evaluationoversamplingstandardization

I have an imbalanced (two-class) classification dataset, based on which I am trying to train and cross-validate a classifier.

During the process of the k-fold cross-validation, I set aside the test subsets before I oversample the remaining (training) subsets. I also standardize (zscore) the training subsets, and I store the Mu and Sigma to be later used in standardizing the test subsets.

The problem is that I do not know whether I should do the oversampling first or the standardization first! Any recommendations/suggestions would be appreciated.

Best Answer

The question is, why do you oversample?

If you have different relative frequencies in your data than you expect in the real application and oversampling is to correct this - then oversampling should be done first (or, to put it differently, you calculated weighted mean and standard deviation, and train a classifier for the corrected prior probabilities).
If you oversample "only" because you have imbalanced classes and try to generate a balanced data set, then IMHO you need to think twice whether this oversampling is good idea at all: there's no use in a classifier that is optimized for balanced classes if in reality the classes are just as imbalanced as your data.
I take from your question that you are already aware that splitting of training and test sets needs to be done first, so cases are independent.

Best Answer

Related Solutions

Solved – Cross validation and imbalanced learning

Related Question