Solved – Binary Classification in Imbalanced Data; Oversampling and Imputation

boostingdata-imputationoversamplingsmoteunbalanced-classes

Together with two friends I participate in a university course about data mining in R and we chose the topic of bankruptcy prediction. We started with some "clean" data found on an "In class" kaggle competition and comparing to the leaderboard our classification seemed to perform very good.

Now we've gone on to use the public dataset about Polish companies from UCI, that has some missing values; apart from that, it's basically the same data. We get AUC scores of 0.96-0.98 depending on the chosen seed, so the classification seems to perform pretty good. The score is better than in the paper by Zieba et al. (2016) that gathered and first worked with the data.

We apply k-fold-cross validation and use the Extreme Gradient Boosting method "xgbLinear". Now dealing with this data we discovered a few strange facts about the performance:

For the kaggle data without missing values, SMOTE Oversampling definitely improved the classification performance. Now for the UCI data, running it without Oversampling performs slightly better. Any ideas how that can be?

Next, we experimented with imputation for the missing values. The best scoring method is to put all of them equal to "-99999" (as Zieba et al. have done as far as I can tell). This makes sense as the distribution of missing values is different for the two classes in our data and so has some predictive power.

We thought we might be able to slightly improve on that method by using KNN imputation for the missing values and then add a dummy variable for every predictor that is 1 for a missing value (prior to imputation) and 0 if there was no missing value. But it turns out this method performs worse than the simpler (and more arbitrary?) method of making all of them equal to "-99999". My intuition is that the dummies should capture the information stored within the distribution of missing values, and additionally to that the KNN imputation should definitely provide a more realistic picture of missing values than the -99999 does. But apparently my intuition is wrong here;)

So in short:

Why does Oversampling lead to worse performance here? Is it connected to the occurrence of missing values in our data?

Why does the "-99999" method perform better than KNN imputation + dummies?
I'd be incredibly thankful for any ideas and input on this! If helpful I can supply our code.

Best Answer

The fact that you are having to consider balance in the Y distribution means that you don't understand the need to predict tendencies rather than make arbitrary classifications (premature decisions). This is discussed in detail here. If you have to sample from your data there is a fundamental logic flaw in the approach.