I have a classification problem with 2 classes. I have nearly 5000 samples, each of which is represented as vector with 570 features. The positive class samples are nearly 600. Meaning, I have a 1:8 ratio of positive and negative samples in the dataset. This imbalance in the dataset is mitigated using SMOTE. Subsequently, classification with 10 fold CV is performed. I get a f-measure of 0.91.
To study the effect of imbalance in the dataset, I tried using the data with imbalance itself (i.e. without SMOTE). This time around, I observed a f-measure of 0.92. I understand the difference is using accuracy and f-measure to interpret the classifier predictions and since I have an unbalanced dataset, I chose to use f-measure.
There seems to be no big difference in the end result whether or not I have an imbalance in the dataset in my case. In this context, I have the following questions:
- Why there seems to be no big difference in the f-measure in both the cases?
- It could be noted that, after I used SMOTE to mitigate the imbalance, the dataset becomes balanced and still I use f-measure to evaluate the classification results. Is it right to use f-measure in this case or should I use accuracy?
- SMOTE does oversampling of the minority class. Similarly, down sampling (or undersampling) the majority class could also rectify the imbalance. Why this methods is not preferred (If I may say so)? What effect does under sampling have on the classifier subjected to training and accuracy compared to oversampling.
Best Answer
I would like to bring to your attention also that in the original SMOTE paper, the good results were based on both combining SMOTE and random under-sampling. This is because applying SMOTE to achieve an equal balance with the majority class is not necessarily the best case for the classifier and as your results show. Thus, you may under-sample the majority to different percentages of the original majority class and then (say 25%, 50%, 75%) , apply SMOTE to minority samples with different numbers of synthetically generated samples (say 2, 3, 4). You end up with a combination of cases and you may choose the one showing better cross-validated results.