Solved – Overfitting and Underfitting

datasetmachine learningoverfitting

I have made some research about overfitting and underfitting, and I have understood what they exactly are, but I cannot find the reasons.

What are the main reasons for overfitting and underfitting?

Why do we face these two problems in training a model?

Best Answer

I'll try to answer in the simplest way. Each of those problems has its own main origin:

Overfitting: Data is noisy, meaning that there are some deviations from reality (because of measurement errors, influentially random factors, unobserved variables and rubbish correlations) that makes it harder for us to see their true relationship with our explaining factors. Also, it is usually not complete (we don't have examples of everything).

As an example, let's say I am trying to classify boys and girls based on their height, just because that's the only information I have about them. We all know that even though boys are taller on average than girls, there is a huge overlap region, making it impossible to perfectly separate them just with that bit of information. Depending on the density of the data, a sufficiently complex model might be able to achieve a better success rate on this task than is theoretically possible on the training dataset because it could draw boundaries that allow some points to stand alone by themselves. So, if we only have a person who is 2.04 meters tall and she's a woman, then the model could draw a little circle around that area meaning that a random person who is 2.04 meters tall is most likely to be a woman.

The underlying reason for it all is trusting too much in training data (and in the example, the model says that as there is no man with 2.04 height, then it is only possible for women).

Underfitting is the opposite problem, in which the model fails to recognize the real complexities in our data (i.e. the non-random changes in our data). The model assumes that noise is greater than it really is and thus uses a too simplistic shape. So, if the dataset has much more girls than boys for whatever reason, then the model could just classify them all like girls.

In this case, the model didn't trust enough in data and it just assumed that deviations are all noise (and in the example, the model assumes that boys simply do not exist).

Bottom line is that we face these problems because:

  • We don't have complete information.
  • We don't know how noisy the data is (we don't know how much we should trust it).
  • We don't know in advance the underlying function that generated our data, and thus the optimal model complexity.