Solved – External validation of clustering requires labels, but why cluster at all if you have labels

classificationclusteringsupervised learningunsupervised learningvalidation

There are two types of validation in clustering, using:

  • Internal indexes: Used to measure the goodness of a clustering structure without respect to external information (e.g., sum of squared errors)

  • External indexes: Consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels (e.g., Rand index, purity, etc.)

I'm confused on the use of external validation indexes in clustering. Since the class labels are known, why use clustering (i.e., unsupervised learning) instead of supervised learning (e.g., SVM, etc.)?

Best Answer

External validity indices are used when you propose a new clustering technique and you want to validate it or you want to compare it to existing techniques. In these cases, you get a bunch of datasets for which you know the ground truth and see if your clustering technique is able to produce clustering solutions that are similar to it.