Solved – Can overfitting and underfitting occur simultaneously

overfitting

I am trying to understand overfitting and underfitting better. Consider a data generating process (DGP)
$$
Y=f(X)+\varepsilon
$$
where $f(\cdot)$ is a deterministic function, $X$ are some regressors and $\varepsilon$ is a random error term independent of $X$. Suppose we have a model
$$
Y=g(Z)+u
$$
where $g(\cdot)$ is a deterministic function, $Z$ are some regressors (perhaps partly overlapping with $X$ but not necessarily equal to $X$) and $u$ is a random error term independent of $Z$.

Overfitting

I think overfitting means the estimated model has captured some noise patterns due to $\varepsilon$ in addition to the deterministic patterns due to $f(X)$. According to James et al. "An Introduction to Statistical Learning" (2013) p. 32,

[Overfitting] happens because our statistical learning procedure is working too hard to ﬁnd patterns in the training data, and may be picking up some patterns that are just caused by random chance
rather than by true properties of the unknown function $f$.

A similar take is available in Wikipedia,

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

A difference between the first and the second quote seems to be that Wikipedia mentions how many parameters are justified by the data, while James et al. only consider whether $g(\cdot)$ is capturing patterns due to $\varepsilon$. If we follow James et al. but not Wikipedia, the line between overfitting and absence thereof seems a bit blurry. Typically, even a very simple $g(\cdot)$ will capture at least some of the random patterns due to $\varepsilon$. However, making $g(\cdot)$ more flexible might nevertheless improve predictive performance, as a more flexible $g(\cdot)$ will be able to approximate $f(\cdot)$ better. As long as the improvement in approximating $f(\cdot)$ outweighs the deterioration due to approximating patterns in $\varepsilon$, it pays to make $g(\cdot)$ more flexible.

Underfitting

I think underfitting means $g(Z)$ is insufficiently flexible to nest $f(X)$. The approximation of $f(X)$ by $g(Z)$ would be imperfect even given perfect estimation precision of the model's parameters, and thus $g(Z)$ would do worse than $f(X)$ in predicting $Y$. According to Wikipedia,

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. Under-fitting would occur, for example, when fitting a linear model to non-linear data.

Simultaneous over- and underfitting

If we follow the definition of overfitting by James et al., I think overfitting and underfitting can occur simultaneously. Take a very simple $g(Z)$ which does not nest $f(X)$, and there will obviously be underfitting. There will be a bit of overfitting, too, because in all likelihood, $g(Z)$ will capture at least some of the random patterns due to $\varepsilon$.

If we follow the definition of overfitting by Wikipedia, I think overfitting and underfitting can still occur simultaneously. Take a rather rich $g(Z)$ which does not nest $f(X)$ but is rich enough to capture lots of random patterns due to $\varepsilon$. As $g(Z)$ does not nest $f(X)$, there will be underfitting. As $g(Z)$ captures lots of random patterns due to $\varepsilon$, there will be overfitting, too; a simpler $g(Z)$ could be found which would improve predictive performance by learning less of the random patterns.

Question

Does my reasoning make sense? Can overfitting and underfitting occur simultaneously?

Best Answer

Your reasoning makes sense to me.

Here is an extremely simple example. Suppose that $X$ consists of only two columns $x_1$ and $x_2$, and the true DGP is

$$ y=\beta_1x_1+\beta_2x_2+\epsilon $$

with nonzero $\beta_1$ and $\beta_2$, and noise $\epsilon$.

Next, assume that $Z$ contains columns $x_1, x_1^2, x_1^3, \dots$ - but not $x_2$.

If we now fit $g(Z)$ (using OLS, or any other approach), we cannot capture the effect of $x_2$, simply because $x_2$ is unknown to $g(Z)$, so we will have underfitting. But conversely, including spurious powers of $x_1$ (or any other spurious predictors) means that we can overfit, and usually will do so, unless we regularize in some way.

Related Solutions

Classification – Addressing Overfitting with Linear Classifiers

A linear regression / classifier can absolutely be overfit if used without proper care.

Here's a small example. Let's create two vectors, the first is simply $5000$ random coin flips:

set.seed(154)
N <- 5000
y <- rbinom(N, 1, .5)

The second vector is $5000$ observations, each randomly assigned to one of $500$ random classes:

N.classes <- 500
rand.class <- factor(sample(1:N.classes, N, replace=TRUE))

There should be no relation between our flips y and our random classes rand.class, they were determined completely independently.

Yet, if we attempt to predict the random flip with the random class using logistic regression (a linear classifier), it sure thinks there is a relationship

M <- glm(y ~ rand.class, family="binomial")
hist(coef(M), breaks=50)

The true value of every one of these coefficients is zero. But as you can see, we have quite a spread. This linear classifier is for sure overfit.

Note: The extremes in this histogram, where the coefficients have wandered to $-15$ and $15$, are cases where a class had either no observations with y == 1 or no values with y == 0. The actual estimated values for these coefficients are plus and minus infinity, but the logistic regression algorithm is hard coded with a bound of $15$.

"overfitting" does not seem to be formally defined. Why is that?

Overfitting may be best understood within the context of a class of models which has some complexity parameter. In this case, a model could be said to be overfit when decreasing the complexity slightly results in better expected out of sample performance.

It would be very difficult to precisely define the concept in a model independent way. A single model is just fit, you need something to compare it to for it to be over or under fit. In my example above this comparison was with the truth, but you usually don't know the truth, hence the model!

Wouldn't some distance measure between training and test set performance allow for such a formalisation?

There is such a concept, it's called the optimism. It's defined by:

$$ \omega = E_{\text{test}} - E_{\text{train}} $$

where $E$ stands for error, and each term is averaged over all possible training and testing sets for your learning algorithm.

It doesn't quite get at the essence of overfitting though, because the performance on a test set can be quite a bit worse than the train, even though a model of higher complexity decreases both.

Machine Learning – Is Overfitting Better Than Underfitting?

Overfitting is likely to be worse than underfitting. The reason is that there is no real upper limit to the degradation of generalisation performance that can result from over-fitting, whereas there is for underfitting.

Consider a non-linear regression model, such as a neural network or polynomial model. Assume we have standardised the response variable. A maximally underfitted solution might completely ignore the training set and have a constant output regardless of the input variables. In this case the expected mean squared error on test data will be approximately the variance of the response variable in the training set.

Now consider an over-fitted model that exactly interpolates the training data. To do so, this may require large excursions from the true conditional mean of the data generating process between points in the training set, for example the spurious peak at about x = -5. If the first three training points were closer together on the x-axis, the peak would be likely to be even higher. As a result, the test error for such points can be arbitrarily large, and hence the expected MSE on test data can similarly be arbitrarily large.

Source: https://en.wikipedia.org/wiki/Overfitting (it is actually a polynomial model in this case, but see below for an MLP example)

Edit: As @Accumulation suggests, here is an example where the extent of overfitting is much greater (10 randomly selected data points from a linear model with Gaussian noise, fitted by a 10th order polynomial fitted to the utmost degree). Happily the random number generator gave some points that were not very well spaced out first time!

It is worth making a distinction between "overfitting" and "overparameterisation". Overparameterisation means you have used a model class that is more flexible than necessary to represent the underlying structure of the data, which normally implies a larger number of parameters. "Overfitting" means that you have optimised the parameters of a model in a way that gives a better "fit" to the training sample (i.e. a better value of the training criterion), but to the detriment of generalisation performance. You can have an over-parameterised model that does not overfit the data. Unfortunately the two terms are often used interchangeably, perhaps because in earlier times the only real control of overfitting was achieved by limiting the number of parameters in the model (e.g. feature selection for linear regression models). However regularisation (c.f. ridge regression) decouples overparameterisation from overfitting, but our use of the terminology has not reliably adapted to that change (even though ridge regression is almost as old as I am!).

Here is an example that was actually generated using an (overparameterised) MLP