Classification and Clustering – Why Choose Clustering with Known Ground Truth?

classificationclusteringmodel-evaluation

If the ground truth of the class/cluster/segment that our observations belong to, is known in advance, why would someone choose to perform clustering instead of classification? In fact, doesn't the problem "automatically" become a classification problem?

This question came to my mind as I was going through some clustering performance evaluation criteria. I came across the Rand index, an evaluation metric that requires the ground truth to be known in advance, hence my question.

Best Answer

You would want to cluster instead of classifying when the real-world problems don't share the same categories as the evaluation set you use.

For instance, let's say you know the true clusters of a small network into six groups. If you were to learn a classifier, then for all future networks you'd only be able to split them into six groups. By contrast, with clustering, you can divide them into arbitrary numbers of groups, which may be more appropriate.

You validate the clustering model on datasets that you know, in the hopes that it generalizes better to ones that you don't.


At a broader level, if the ground truth is known, then there's nothing left for you to predict—so the only reasonable goal is trying to understand (or explain) that structure. You can have competing hypotheses (competing models) of how that ground truth structure arose. Some might be drawn from clustering literature; others from classification literature. Each has its own set of inductive biases.

Related Question