Solved – Maximum likelihood estimators and overfitting

maximum likelihoodoverfitting

In his book, Bishop claims that overfitting is caused by an unfortunate property of the Maximum likelihood estimator. I dont really understand how the MLE relates to overfitting.

To me, roughly, overfitting is related to the model complexity, i.e. the more parameter I have, the more my model tends to overfit (i.e., to model the random noise).

Maximum likelihood estimation, however, is just a way to estimate statistics from my sample (or training set). As far as I understand it, it does not regulate the number of parameters whatsover and therefore I do not see the connection between MLE and overfitting.

Also, Maximum likelihood estimators often are biased. But biased models rather tend to underfit than overfit.

1.) How are these two things related and how does the MLE induce overfitting?

2.) Is there a "mathematical" justification, i.e., is it possible to show in terms of formulae, how these two things are connected? (as a similar question was already asked here but only with rather handwaving answers)

3.) Which "unfortunate property" of MLE is it, that Bishop claims to be the reason for overfitting?

Best Answer

The key to understanding Bishop's statement lies in the first paragraph, second sentence, of section 3.2: "... the use of maximum likelihood, or equivalently least squares, can lead to severe over-fitting if complex models are trained using data sets of limited size".

The problem comes about because no matter how many parameters you add to the model, the MLE technique will use them to fit more and more of the data (up to the point at which you have a 100% accurate fit), and a lot of that "fit more and more of the data" is fitting randomness - i.e., overfitting. For example, if I have $100$ data points and am fitting a polynomial of degree $99$ to the data, MLE will give me a perfect in-sample fit, but that fit won't generalize at all well - I really cannot expect to achieve anywhere near a 100% accurate prediction with this model. Because MLE is not regularized in any way, there's no mechanism within the maximum likelihood framework to prevent this overfitting from occurring. This is the "unfortunate property" referred to by Bishop. You have to do that yourself, by hand, by structuring and restructuring your model, hopefully appropriately. Your statement "... it does not regulate the number of parameters whatsoever..." is actually the crux of the connection between MLE and overfitting!

Now this is all well and good, but if there were no other model estimation approaches that helped with overfitting, we wouldn't be able to say that this was an unfortunate property specifically of MLE - it would be an unfortunate property of all model estimation techniques, and therefore not really worth discussing in the context of comparing MLE to other techniques. However, there are other model estimation approaches - Lasso, Ridge regression, and Elastic Net, to name three from a classical statistics tradition, and Bayesian approaches as well - that do attempt to limit overfitting as part of the estimation procedure. One could also think of the entire field of robust statistics as being about deriving estimators and tests that are less prone to overfitting than the MLE. Naturally, these alternatives do not eliminate the need to take some care with the model specification etc. process, but they help - a lot - and therefore provide a valid contrast with MLE, which does not help at all.

Related Question