Solved – Why is it not advised to use k-means for classification

classificationclusteringk-means

It seems like a trivial extension of k-means clustering if you have labelled training data is to assign each centroid a label, and given a new piece of test data to classify it to the label corresponding to the centroid with the closest distance.

However, I could not find much online about the (ab)use of k-means clustering for this purpose, and people advise using just k-nearest neighbors or SVMs directly.

Why would (ab)using k-means clustering via this relatively straightforward method not yield good results?

Best Answer

Where is the benefit of doing this?

A cluster found by KMeans may contain many different labels, so you decrease quality!

Just so nearest-neighbor classification.

But if you look into kNN classification literature, you will find different ways of reducing the "training" set. I am certain there is at least one paper suggesting to use k-means with a very large k to reduce redundancy in your training data. But you need to smartly handle clusters that contain more than one label.