Cross Validation – Reconciling Conflicting Advice on Train vs Test Error Gap and Overfitting

cross-validationoverfitting

There seems to be conflicting advice out there about how to handle comparing train vs test error, particularly when there is a gap between the two. There seem to be two schools of thought that to me, seem to conflict. I'm looking to understand how to reconcile the two (or understand what I'm missing here).

Thought #1: A gap between train and test set performance alone does not indicate overfitting

First, (also discussed here: How can training and testing error comparisons be indicative of overfitting?), the idea that a difference between train and test set alone cannot indicate overfitting. This agrees with my practical experience with, for example, ensemble tree methods, where even after cross-validation based hyper parameter tuning, the gap between train and test error can remain somewhat large. But (irrespective of model type) as long as you're validation error isn't going back up, you're good. At least, that's the thinking.

Thought #2: When you see a gap between train and test performance: Do things that would combat overfitting

However, then there's advice that you do see, from very good sources that suggest that a gap between train and test error is indicative of overfitting. Here's an example: The "Nuts and Bolts of Deep Learning" talk by Andrew Ng (a fantastic talk) https://www.youtube.com/watch?v=F1ka6a13S9I where at around time stamp 48:00 he draws a flow chart that says "if your train set error is low and your train-dev set error is high, you should add regularization, get more data, or change the model architecture"… which are all actions you might take to combat overfitting.

Which brings me to…
: Am I missing something here? Is this a model specific rule of thumb (generally simpler models do appear to have less gap between train and test)? Or are there simply two different schools of thought?

Best Answer

I do not think this is conflicting advice. What we are really interested in is good out-of-sample performance, not in reducing the gap between training and test set performance. If the test set performance is representative of out-of-sample performance (i.e. the test set is large enough, uncontaminated and is a representative sample of the data our model will be applied to), then as long as we get good performance on the test set we are not overfitting, regardless of the gap.

Often, however, if there is a large gap, it may indicate that we could get better test set performance with more regularization/introducing more bias to the model. But that does not mean that a smaller gap means a better model; it's just that if we have a small or no gap between training and test set performance, we know we are definitely not overfitting so adding regularization/introducing more bias to the model will not help.

Related Question