Solved – Problem with classifier after using SMOTE to balance the data

classificationoversamplingunbalanced-classes

We've ran into a problem while training a classifier on an unbalanced data set.

The response is binary with 0 indicating 'non defaulter' and 1 indicating 'defaulter' (it's a credit scoring task).

The defaulters only account for 0.47 % (233 observations of 47k-ish). We used the SMOTE algorithm to over sample the minority class and thus balancing the data set. This is a known method highly encouraged in credit-scoring applications, or in any other classifying situations where the distribution in the response is skewed.

We tested different volumes of SMOTE-oversampling, but finally settled with a data set where the ratio of defaulters is about 40 %, which means the SMOTE algorithm produced about 26567 artificially created observations. This set now holds the TRUE defaulters, the non defaulters and the artificially created observations which of course are all labeled as defaulters.

After partitioning the data we've trained different kinds of classifiers and compared their results using the "Holdout method" (EG using a test set). The most successful classifiers was a neural network with one hidden layer (holding 30 neurons), feed forward structure, weight "optimizer" used was back propagation, and the actiovation function was set to hyperbolic. We also used an ensembler since we noticed that boosting (10 loops) increased the prediction accuracy.

The model produces very good results overall. For example, the True Positive Rate is about 93 %, and the True Negative Rate is even higher. The area under the classifiers ROC-curve is well over 0.9.

We were strutting around like glorified roosters after creating such an über model – until we decided to label the 233 TRUE defaulters and do a follow up on how they were classified. To our horror the model only classified about 60 % av them to the defaulting class.

Our guess is that the SMOTE algorithm might have went a little bit crazy and overlapped into the non-defaulting group a bit too much when creating the artificial defaulters.

Is there a way to prevent this from happening? Is undersampling the majority class of non defaulters and combining this with a SMOTE-oversampler of say 20 % a good approach? Why and why not?

We ran into something called Tomek links which seems to "inverse" the risky effects of SMOTEing a bit.

Best Answer

It seems like you are oversampling (i.e. generating synthetic data instances) before splitting training and testing data. This causes over-fitting and hence your optimistic initial results. As pointed here, you should consider applying oversampling after splitting your data.