Solved – k-means clustered data: how to label newly incoming data

classificationclusteringk-meansmachine learningsvm

I have a data set with labels that were produced by a k-means clustering algorithm. Now there is some data (with the same data structure) from another source and I wonder what is the most sensible way to label this new, yet unseen data? I was thinking about either

  • calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly
  • run a new algorithm (e.g. SVM) on the new data using the old data as the training set

Unfortunately, I couldn't find anything about this particular problem. There are only a few questions about the general use of k-means as a classification model:

  • Can k-means clustering do classification?
  • How to segment new data with existing K-means model?

Best Answer

You are correct on

calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly

The reason run a new algorithm (e.g., SVM) will not work is because clustering is different from supervised learning that you have a label for each data point. If we have new data, we still do not have their labels. So, what we can used is just the output from the clustering, i.e., centroid.