Machine Learning – Is Overfitting Better Than Underfitting?

bias-variance tradeoffmachine learningneural networksoverfitting

I've understood the main concepts behind overfitting and underfitting, even though some reasons as to why they occur might not be as clear to me.

But what I am wondering is: isn't overfitting "better" than underfitting?

If we compare how well the model does on each dataset, we would get something like:

Overfitting: Training: good vs. Test: bad

Underfitting: Training: bad vs. Test: bad

If we have a look at how well each scenario does on the training and test data, it seems that for the overfitting scenario, the model does at least well for the training data.

The text in bold is my intuition that, when the model does badly on the training data, it will also do badly on the test data, which seems overall worse to me.

Best Answer

Overfitting is likely to be worse than underfitting. The reason is that there is no real upper limit to the degradation of generalisation performance that can result from over-fitting, whereas there is for underfitting.

Consider a non-linear regression model, such as a neural network or polynomial model. Assume we have standardised the response variable. A maximally underfitted solution might completely ignore the training set and have a constant output regardless of the input variables. In this case the expected mean squared error on test data will be approximately the variance of the response variable in the training set.

Now consider an over-fitted model that exactly interpolates the training data. To do so, this may require large excursions from the true conditional mean of the data generating process between points in the training set, for example the spurious peak at about x = -5. If the first three training points were closer together on the x-axis, the peak would be likely to be even higher. As a result, the test error for such points can be arbitrarily large, and hence the expected MSE on test data can similarly be arbitrarily large.

source

Source: https://en.wikipedia.org/wiki/Overfitting (it is actually a polynomial model in this case, but see below for an MLP example)

Edit: As @Accumulation suggests, here is an example where the extent of overfitting is much greater (10 randomly selected data points from a linear model with Gaussian noise, fitted by a 10th order polynomial fitted to the utmost degree). Happily the random number generator gave some points that were not very well spaced out first time!

enter image description here

It is worth making a distinction between "overfitting" and "overparameterisation". Overparameterisation means you have used a model class that is more flexible than necessary to represent the underlying structure of the data, which normally implies a larger number of parameters. "Overfitting" means that you have optimised the parameters of a model in a way that gives a better "fit" to the training sample (i.e. a better value of the training criterion), but to the detriment of generalisation performance. You can have an over-parameterised model that does not overfit the data. Unfortunately the two terms are often used interchangeably, perhaps because in earlier times the only real control of overfitting was achieved by limiting the number of parameters in the model (e.g. feature selection for linear regression models). However regularisation (c.f. ridge regression) decouples overparameterisation from overfitting, but our use of the terminology has not reliably adapted to that change (even though ridge regression is almost as old as I am!).

Here is an example that was actually generated using an (overparameterised) MLP

enter image description here

Related Question