Solved – Random sampling methods for handling class imbalance

classificationunbalanced-classes

https://www.svds.com/learning-imbalanced-classes/ explains quite nicely the different ways to handle an imbalanced dataset. But there is an information under the random undesampling and random oversampling technique which I am not sure is correct or not, as I could not find the same information in different research articles (paper1,paper2). The information for which I will highly appreciate clarifications are:

pic

1) Does random oversampling the minority class increase the size of the final data set such that each class is the same size as that of the majority?

2) Does random undersampling decrease the total size of the dataset such that each class is the same size as that of the minority?

For example, if the minority class has 20 examples and majority class has 80 examples, then would the result of random oversampling be: (20+80) + 80 = 180

and for the random undersampling technique: 20 + (80-60) = 20+20 =40?

3) Are these methods random sampling with or without replacement?

Best Answer

1) Does random oversampling the minority class increase the size of the final data set such that each class is the same size as that of the majority?

It does, however it doesn't have to. Depending on the implementation one could of course also oversample to a size that is bigger than the original majority class if the misclassification costs associated with the problem warrent that.

Same goes for undersampling - the general idea is to balance the data set, so getting them to matching size makes sense unless we have a reason to tip the class balance the other way.

3) Are these methods random sampling with or without replacement?

In the case of the first graphic you posted and assuming our objective is to even out the size of the classes, we will have to sample with replacement unless we synthezise new instances from the known ones, like SMOTE does.

In the case of undersampling both with and without replacement is possible, although I'm only aware of it being used without replacement then.

Related Question