Solved – Is overfitting a problem in unsupervised learning

density-estimationoverfittingunsupervised learning

Consider the density estimation problem for some training set $(x_1 … x_N)$. A gaussian mixture model consisting of $N$ normal distributions centered on each $x_i$ with very small variances will "overfit": the likelihood will be very high on the training data, and very low on unseen data points.

My questions are:

  • Is overfitting considered a problem in unsupervised learning, just as it is in supervised learning? (it is certainly not discussed as frequently!)
  • Should cross-validation be used to prevent overfitting of unsupervised models?
  • Are there theoretical results similar to the generalization bounds derived in the supervised setting? (results that would for example relate the expected likelihood, the likelihood on the training set, the sample size and the model complexity)

Best Answer

We talk about overfitting when the model performs better on training sample, then on validation sample. First of all, how would you define overfitting for unsupervised learning? If you conduct, say, clustering analysis of your data, then there is no objective criteria to say that some output is "correct". Even more, there is no "correct" clustering solution, as there is no labels in unsupervised scenario. How would you judge performance of clustering? How would you say that it performs "worse" on validation sample? The same applies to cross-validation. You can check how stable is some clustering solution as learned on multiple subsamples, but this has nothing to do with under, or overfitting.

On another hand, you can say about sort of overfitting in unsupervised case. If you fit $n$ clusters to $n$ cases, then you'd end up with (useless) clustering solution that does not translate to external data. In such case, clustering would overfitt by design, but this is not really measurable.

The same with density estimation. There is no single "correct" solution. On another hand, if you set bandwidth in kernel density estimation to zero, you'll end up with density estimate that fits perfectly to your data, but does not translate to external data. The whole trick in here is to find solution that is general enough to be useful, and detailed enough to share some specific features of your data--but there is no single best solution like this.

Related Question