Solved – Why is cross validation error high upon overfitting

cross-validationerrormachine learningoverfitting

http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote13.html

ref: Figure 1: overfitting and underfitting

Shouldn't cross validation error follow training error and remain low ?
Is this because the cross validation data set is smaller than training data set?
Overfitting by definition means the model is fitting perfectly and produces expected result and thus the error is supposed to be low or none. What am i missing?

Best Answer

Sorry, my rep is too low to comment so will be posting as an answer.

The benefit of conducting CV is that you can train your model over the entire data that you have and yet still be able to get a good estimate of the true error of your model.

The more variables you include in your model, the lower the training error will get. However, doing so results in overfitting because your model becomes too specialized to its training data that when unseen data comes along it will instead perform worse. As Michael said, this is due to the model, in order to minimize training error, ends up fitting to the noise present in the data. When you then try to predict unseen data which will have a different noise signature using the model, you will end up getting a greater prediction error.

CV simulates this environment by holding out data for test purposes. This plays the role of the unseen data. CV then does this K times and averages the error as the validation error. Hence the validation error increases if the model is overfitted.

Related Question