Solved – Can overfitting and underfitting occur simultaneously

overfitting

I am trying to understand overfitting and underfitting better. Consider a data generating process (DGP)
$$
Y=f(X)+\varepsilon
$$

where $f(\cdot)$ is a deterministic function, $X$ are some regressors and $\varepsilon$ is a random error term independent of $X$. Suppose we have a model
$$
Y=g(Z)+u
$$

where $g(\cdot)$ is a deterministic function, $Z$ are some regressors (perhaps partly overlapping with $X$ but not necessarily equal to $X$) and $u$ is a random error term independent of $Z$.

Overfitting

I think overfitting means the estimated model has captured some noise patterns due to $\varepsilon$ in addition to the deterministic patterns due to $f(X)$. According to James et al. "An Introduction to Statistical Learning" (2013) p. 32,

[Overfitting] happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance
rather than by true properties of the unknown function $f$.

A similar take is available in Wikipedia,

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

A difference between the first and the second quote seems to be that Wikipedia mentions how many parameters are justified by the data, while James et al. only consider whether $g(\cdot)$ is capturing patterns due to $\varepsilon$. If we follow James et al. but not Wikipedia, the line between overfitting and absence thereof seems a bit blurry. Typically, even a very simple $g(\cdot)$ will capture at least some of the random patterns due to $\varepsilon$. However, making $g(\cdot)$ more flexible might nevertheless improve predictive performance, as a more flexible $g(\cdot)$ will be able to approximate $f(\cdot)$ better. As long as the improvement in approximating $f(\cdot)$ outweighs the deterioration due to approximating patterns in $\varepsilon$, it pays to make $g(\cdot)$ more flexible.

Underfitting

I think underfitting means $g(Z)$ is insufficiently flexible to nest $f(X)$. The approximation of $f(X)$ by $g(Z)$ would be imperfect even given perfect estimation precision of the model's parameters, and thus $g(Z)$ would do worse than $f(X)$ in predicting $Y$. According to Wikipedia,

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. Under-fitting would occur, for example, when fitting a linear model to non-linear data.

Simultaneous over- and underfitting

If we follow the definition of overfitting by James et al., I think overfitting and underfitting can occur simultaneously. Take a very simple $g(Z)$ which does not nest $f(X)$, and there will obviously be underfitting. There will be a bit of overfitting, too, because in all likelihood, $g(Z)$ will capture at least some of the random patterns due to $\varepsilon$.

If we follow the definition of overfitting by Wikipedia, I think overfitting and underfitting can still occur simultaneously. Take a rather rich $g(Z)$ which does not nest $f(X)$ but is rich enough to capture lots of random patterns due to $\varepsilon$. As $g(Z)$ does not nest $f(X)$, there will be underfitting. As $g(Z)$ captures lots of random patterns due to $\varepsilon$, there will be overfitting, too; a simpler $g(Z)$ could be found which would improve predictive performance by learning less of the random patterns.

Question

Does my reasoning make sense? Can overfitting and underfitting occur simultaneously?

Best Answer

Your reasoning makes sense to me.

Here is an extremely simple example. Suppose that $X$ consists of only two columns $x_1$ and $x_2$, and the true DGP is

$$ y=\beta_1x_1+\beta_2x_2+\epsilon $$

with nonzero $\beta_1$ and $\beta_2$, and noise $\epsilon$.

Next, assume that $Z$ contains columns $x_1, x_1^2, x_1^3, \dots$ - but not $x_2$.

If we now fit $g(Z)$ (using OLS, or any other approach), we cannot capture the effect of $x_2$, simply because $x_2$ is unknown to $g(Z)$, so we will have underfitting. But conversely, including spurious powers of $x_1$ (or any other spurious predictors) means that we can overfit, and usually will do so, unless we regularize in some way.

Related Question