Solved – Does threshold selection of F1-score in Cross-validation lead to overfitting

cross-validationmachine learningmetricmodel selectionunbalanced-classes

I have a highly imbalanced binary classification problem. Right now I perform a 10-fold cross-validation while training my model (Convolutional Neural Network). Each fold generates its own F1-score, then I average all 10 F1-scores to produce the mean F1-score.

The question is if I select an optimal threshold for each fold's F1-scores, and then find the average of all F1-scores (which should definitely give me a better result in comparison to the threshold=0.5), would that be considered over-fitting? Because I already looked at the labels (precisions and recalls at different thresholds) while choosing a threshold, and then "chose" the most optimal F1-score.

Additionally, I didn't do any test set split. I assume 10 repetitions of 10-fold CV should be a good approximation to the test set as it is difficult to overfit in this case.
The final model prediction at test-time would either be an average of 10 models, or a single model trained on the whole dataset. Not sure which option is better.

Best Answer

A proper procedure always includes training - validation and test set. A small dataset is usually not a good reason to not include a test set, as only this will give you a reasonable estimation of your model.

Your approach does not necessarily lead to overfitting. But you will not notice if that happens. How usefull is a model, that does not generalize enough? Especially in a highly imbalanced classification problem, you should hold out a test set.

Notice that in k-fold CV every instance is predicted exactly once, so in fact you do not "average" anything here. You just kind of "merge" your predictions of each fold.

After you select your threshold, tune parameters etc. you want to use all of your data to train a final model to get the best possible results.

Related Question