Solved – k-means clustered data: how to label newly incoming data

classificationclusteringk-meansmachine learningsvm

I have a data set with labels that were produced by a k-means clustering algorithm. Now there is some data (with the same data structure) from another source and I wonder what is the most sensible way to label this new, yet unseen data? I was thinking about either

calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly
run a new algorithm (e.g. SVM) on the new data using the old data as the training set

Unfortunately, I couldn't find anything about this particular problem. There are only a few questions about the general use of k-means as a classification model:

Can k-means clustering do classification?
How to segment new data with existing K-means model?

Best Answer

You are correct on

calculating the distance to the prior k-means centroids and label the data to the the nearest centroids accordingly

The reason run a new algorithm (e.g., SVM) will not work is because clustering is different from supervised learning that you have a label for each data point. If we have new data, we still do not have their labels. So, what we can used is just the output from the clustering, i.e., centroid.

Related Solutions

Solved – K-Means clustering after first iteration

I am not sure if there is a "standard" thing to do in the case one of the initial centroids is completely off.

You can easily test this by specifying the initial centroids and see how things evolve!

For instance, R will just give you an error.

Say you do:

# Set the RNG seed to ensure reproducibility
set.seed(12345)

# Let's create 3 visually distinct clusters
n <- c(1000, 500, 850)
classifier.1 <- c(rnorm(n[1], 10, 0.9), 
                  rnorm(n[2], 25, 2),
                  rnorm(n[3], 35, 2))
classifier.2 <- c(rnorm(n[1], 5, 1),
                  rnorm(n[2], 10, 0.4),
                  rnorm(n[3], 2, .9))

col = c("blue", "darkgreen", "darkred")
# Run k-means with 3 clusters and random initial centroids 
# to check the clusters are correctly recognized
km <- kmeans(cbind(classifier.1, classifier.2), 3)
# Plot the data, colored by cluster
plot(classifier.1, classifier.2, pch=20, col=col[km$cluster])

# Mark the final centroids
points(km$centers, pch=20, cex=2, col="orange")

# Now impose some obviously "wrong" starting centroids
start.x <- c(10, 25, 3000)
start.y <- c(10, 10, -10000)
km.2 <- kmeans(cbind(classifier.1, classifier.2), 
               centers=cbind(start.x, start.y))

Now, R has obviously no issue in discriminating the 3 clusters when you let it choose the initial centroids, but when you run it the second time it will just say:

Error: empty cluster: try a better set of initial centers

I guess that if you are implementing your own algorithm you may choose to use this behaviour or rather give the user a warning and let the algorithm choose the centroids by itself.

Obviously, as others pointed out, there are algorithms such as k-means++ that help in choosing a good set of starting centroids.

Also, in R you can use the nstart parameter of the kmeans function to run several iterations with different centroids: this will improve clustering in certain situations.

EDIT: also, note from the R kmeans help page

The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. For ease of programmatic exploration, k=1 is allowed, notably returning the center and withinss.

Except for the Lloyd–Forgy method, k clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan–Wong method.

Solved – Class labels in data partitions

If you consider stratified sampling, I think something similar could be done here, assuming your class is not so under represented that it does not even have 3 examples (one for training, one for testing and one for cross-validation).

Using a method like stratified sampling, you would make certain that each class is represented by randomly selecting instances of that class for each data set.

If you are running into this problem, you also might question whether you have enough data to train your algorithm well. Sometimes the correct answer is to get more data, assuming that is possible.

Best Answer

Related Solutions

Solved – K-Means clustering after first iteration

Solved – Class labels in data partitions

Related Question