Solved – Hyper Parameter Tuning for Unbalanced Data

hyperparametermachine learningmodel selectionoverfittingunbalanced-classes

My question regards the hyperparameter tuning for ML algorithms. It has more to do with the theoretical aspects than the actual coding part.
Suppose there is dataset which is extremely imbalanced as seen in cases of fraud detection, electricity thefts data etc with very few positive classes (fraud) as compared to the negative classes (normal cases). For this kind of cases, we need to train the model on resampled data and test it on actual test data.
However, for parameter tuning (without using GridSearchCV), how do we test model performance without exposing it to the actual holdout data, as it can lead to overfitting since we are showing it the actual testing data, and hence the whole purpose is defeated.
I have read that we can split the training data into a training (for model training)+eval(for parameter tuning) and then select the parameters based on this validation set and finally build the model on the combined train+eval sets.
But can this approach be applied to imbalanced datasets too, as it seems suspicious in this case since we would be fitting our parameters based on resampled data? Or is there another way around this?

Best Answer

Unless you have reasons not to, you should probably use cross-validation for hyperparameter tuning. The approach you describe (and, indeed, pretty much any preprocessing you want to perform on the data) can be applied within cross-validation; the important concept to understand is that you should be applying your transformations within the cross-validation folds (with few exceptions). I.e. if you want to upsample your positive cases, do it on the training data of the cross-validation set in isolation, as if you do not know anything about the data that is not in your current cross-validation fold's training set.

In other words, treat your validation sets the way you would treat your test set - leave them as they are and do not peak at them. Transform your training sets (in your case - resample them) without using information from the validation sets. You can then evaluate your model on the validation sets and get a reasonably unbiased estimate of its performance. Once you have tuned your model, you can evaluate it on the test set - but aim to only do it once; do not adjust your model based on test set results.