Is it possible for an ML model to ‘overfit’ training data but perform fairly well on val set, i.e generalize well too

machine learningoverfitting

I have trained a Vgg-16 model to classify images. The images are fairly difficult to classify as good or bad considering there are both clear cut cases of good and bad images but also a good chunk of borderline images which can either be classified as good or bad (like different human annotators or even same human annotators at different times may mark them differently).

Upon training I see that training accuracy rises quite rapidly but from about 87-88 percent accuracy mark rises gradually but surely to 0.996 or even 1.0. The val set accuracy however does not rise after a certain point but fluctuates at around 0.85 to 0.90 mark. It does not go down ,however, even with more and more epochs.

I suspect that there is a pretty prevalent overfitting problem but a bit puzzled too since val accuracy doesnt go down after that mark and in fact the val accuracy is similar to human annotator's accuracy. I have tried loads of anti-overfitting techniques like dropout reg, normal reg, making sure train and val sets are distributed identically, removing redundant images (One thing I haven't tried yet is making the model simpler but i am not sure if that will help much). I would also like to add that data annotation for training is also somewhat ambiguous for the aforementioned borderline cases of images since different human annotators may mark the data differently. For clear cut good and bad cases of images, that's not a problem

The thing is that the model does indeed perform well in tests and in our app integration phases. It seems to generalize well and gives output that would would be expected as with a human annotator, or slightly less to be more accurate. Please let me know of what you guys think of these observations and let me know if you require further information

Best Answer

If by overfitting you mean that the training error is 0 or near 0, then yes. This is a pretty common phenomenon and there are lots of papers trying to explain why some models which perfectly interpolate training data perform well and others perform badly. For example here is a recent preprint about this phenomenon: https://arxiv.org/abs/2106.03212

I think you shouldn't be too worried about your model's performance, as long as you have a test set to check the performance against. Val set performance will be a biased estimate of production performance if you are using it for early stopping.

As an aside I think it's kind of unfortunate that people use "overfitting" in this way. If out of sample error never starts going down, then in what sense have you fit too much? Oh well.

Related Question