Solved – Can a classifier trained with oversampled data be used to classify unbalanced data

classificationunbalanced-classes

I am developing a random forest model for predicting fraudulent credit card transactions. I have made a train and test split in my dataset, and finally chosen a model through different metrics, including accuracy, recall, and AUC. Before, I have had issues because of extremely unbalanced datasets (only 2% fraudulent transactions). After some oversampling, I have used a 50%/50% fraud/no fraud dataset to train my model. Now, this model, and its fitted classifier, will be used in production mode. Is it legitimate to use this classifier, trained with a balanced dataset, even if the transactions that it will be classifying will be mostly not fraudulent? Won't it be biased towards classifying transactions as fraudulent?

EDIT:

In order to evaluate my model (implemented in scikit-learn) I was using the scores obtained from the train test split provided as a built-in method. I realized this might be providing optimistic accuracy, recall, and AUC scores, and it was probably overfitting the mode. So, I decided to use scikit-learn k-fold cross validation. The results obtained through this method are much worse. For example, recall used to be 69% when evaluating the model against the test data, but it is 18% when using 5 fold cross validation (mean of recall scores per iteration). This improves a little if I modify the class_weight parameter to {0:0.99, 1:0.01}, but this doesn't makes sense, I think, as it penalizes errors in the classification of 0's as 0's and not the other way around, that is, for the more uncommon events (1's or positives). Does this mean my model is overfitting? Which measure is more accurate to evaluate the real world performance of my model? Does it even make sense to use cross-validation with random forests?

Best Answer

This is actually an interesting question that comes across alot in medical data. One of the ways to understand oversampling and classification of unbalanced data is because oversampling is an active bias of sampling the data, the results will be biased. When compensating for the minority class, remember that the goal of classification is to identify characteristics that can determine which class an outcome can belong to and then address how the independent variables interact.

When oversampling data for classification, remember to use cross-validation properly and to oversample data during the cross-validation as opposed to before the cross-validation. This will give you better (more accurate) scores with sensitivity and specificity and limit (though not eliminate) the effect of bias and overfitting due to improperly using cross-validation and oversampling.

Here is a good reference using preterm births: http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation