Solved – By using SMOTE the classification of the validation set is bad

classificationpythonresamplingscikit learnunbalanced-classes

I want to do classification with 2 classes. When I classify without smote I get:

Precision       Recall            f-1
0,640950987     0,815410434       0,714925374

When I use smote: (oversample the minority class at 200% and k = 5)

Precision        Recall           f-1
0,831024643      0,783434343      0,804894232

As you can see this works well.

However, when I test this trained model on validation data (which hasn't got any synthetic data)

Precision        Recall           f-1
0,644335755      0,799044453      0,709791138

which is awful. I used a random decision forest to classify.

Has anyone got any idea why this is happening? Any useful tips regarding extra tests I can try to get more insight are welcome too.

More info:
I do not touch the majority class. I work in Python with scikit-learn and this algorithm for smote.

The confusion matrix on the test data (which has synthetic data):
enter image description here

The confusion matrix on the validation data with the same model (real data, which was not generated by SMOTE)

enter image description here


Edit: I read that the problem possibly lies in the fact that Tomek Links were created. Therefore I wrote some code to remove the tomek links. Though this doesn't improve the classification scores.

Edit2: I read that the problem possibly lies in the fact there is too much of an overlap. A solution for this is a more intelligent synthetic sample generation algorithm. Therefore I implemented

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

My implementation can be found here. It performed worse than smote.

Best Answer

If you oversample the training data to change the relative class frequencies in the training set, you are implicitly telling the classifier to expect the validation set to have those same class frequencies. As the class frequencies influence the decision boundary, if the validation set class frequencies are different, then the classification boundary will be wrong as a result and performance will be sub-optimal.

Now it is true that for some classifiers, having imbalanced classes can cause it to perform badly, and re-sampling the data can help to correct this. However if you oversample the minority class too much, you will over-correct and performance will be suboptimal because the the difference in class frequencies between the training and validation sets is more than is needed to correct the imbalance problem.

My recipe then would be to use cross-validation to choose the degree of oversampling that is required to just compensate for the "class imbalance problem" experienced by your classifier, but no more. Note however this leads to an additional parameter that needs to be tuned, which leads to greater computational expense, and a higher likelihood of over-fitting in model selection (if the total number of tunable parameters is non-negligible).