Solved – Is it necessary to split data in clustering like in supervised learning

clusteringunsupervised learning

I'm learning clustering analysis and one book I read says the clustering model should be applied to a disjoint data set to examine the consistency of the model.

I think in clustering analysis we don't need to split the data into train and test sets like in supervised learning since without labels there is nothing to "train".

So what is the possible meaning of this "consistency"? How is it evaluated? Is this disjoint data set really necessary?

Thank you!

Edit: There isn't really a broader context. The text talks about how to select optimal number of clusters and then mentions this. I don't think this consistency is about the number of clusters…

Best Answer

Train-Test split is used to avoid overfitting in machine learning.

In unsupervised clustering, you cannot evaluate, and thus you cannot overfit in this way.

You can however overfit in different ways, by choosing e.g. an unsupervised evaluation criterion that measures a quantity that your clustering procedue also uses. You don't get the best result, but prefer the algorithm that is most related to your evaluation procedure. Don't use these measures to compare different algorithms.