Solved – What exactly is overfitting

definitionoverfitting

Many people (including me) is thinking or used to think that an overfitted model is the model in which the training error >> the validation error. But after reading this very interesting comment by @Firebug, I suddenly realized that it is not true. Random Forest is a perfect example of this, the training error is often closed to 0, the out-of sample is often far smaller than the latter but close to the test sample.

Another example is presented below:

enter image description here

People often refer the green curve as overfitting, and the black curve is better because the testing error of the green curve is lower than that of the training set. But it can happen that even the testing error of the green curve is lower than training error, but on the other blind test, the green curve is still better than the black curve.

So my questions are:

  1. Is the black curve better than the green curve?
  2. what exactly is overfitting, and what is the proper way of identifying an overfitted model?
  3. It is not true to say that an overfitted model is worse than the non-overfitted model?

Best Answer

  1. You can't determine which curve is better by staring at them. And by "staring" I mean analyzing them based on pure statistical features of this particular sample.

For instance, the black curve is better than the green one if the blue dots that stick out of the blue area into the red are by a pure chance, i.e. random. If you obtained another sample and the blue dots in the red area disappeared, while other blue dots showed up, this would mean that the black curve is truly capturing the separation, and the deviations are random. BUT how would you know this by looking at this ONE sample?! You can't.

Therefore, lacking the context it is impossible to say which curve is better by just staring at this sample and the curves on it. You need exogenous information, which could be other samples or your knowledge of the domain.

  1. Overfitting is the concept, and there's no one right way of identifying the issue that works for any domain and any sample. It's case by case.

Like you wrote the dynamics of error reduction in training and testing samples is one way. It goes to the same idea that I wrote above: detecting that the deviations from the model are random. For instance, if you obtained another sample, and it rendered the different blue points in red area but these new points were very close the old one - this would mean that the deviations from the black line are systematic. In this case you would naturally gravitate towards the blue line.

So, overfitting in my world is treating random deviations as systematic.

  1. Overfitting model is worse than non overfitting model ceteris baribus. However, you can certainly construct an example when the overfitting model will have some other features that non-overfitting model doesn't have, and argue that it makes the former better than the latter.

The main issue with overfitting (treating random as systematic) will mess up its forecasts. It does so mathematically because it becomes very sensitive to those inputs that are not important. It converts the noise in inputs into a false signal in the response, while the non-overfitting ignores the noise and produces smoother response, hence higher signal to noise ratio in the output.