Solved – Using real data means as centroids for clustering

k-means

Suppose we have a data set generated from k different distributions.

In k-means, the data classification step (in which we associate each data point to the nearest centroid) uses the current centroids to generate new data clusters. We can say that the better the centroids positions are, the better the classification is.

My doubts comes when we talk about the relation between centroids and the real data means. We can't say that k-means will always find the real means, but the resulting centroids after a full execution of this method will certainly be close to them.

Considering that the classification step means only associating each point to the closest centroid, can we say that the real means are the average optimal centroid positions for data classification (classifying data only once, with no centroid update)?

I thought the real mean of a distribution is the point that minimizes the sum of squared error when we generate a high amount of data points with that distribution, so I guessed it would be reasonable to say that.

I would appreciate if you guys could point me any references about this point. Thanks in advance!

Best Answer

First, to be clear, the term "centroid" is just another way of saying the "mean". K-means clustering, when performed in accordance to its definition, should always be redefining the mean of the cluster exactly as the real mean of all the data points that were categorized into that cluster on that iteration. It seems like your point of confusion is on the re-classification step, but I'm not completely sure what your question is.

Yes, when re-classifying, you are using the mean of one distinct set of points in order to define a cluster of another distinct set of points - whose mean will likely be different from the mean of the first set of points. But these centroids/means are always the correct "real" mean of a particular set of points (after all, that's how you find these points), except perhaps your first assignments. I hope that helps, and that I'm not just restating a bunch of stuff you may already know. Perhaps try to clarify your question a little more.

There are other algorithms similar to k-means that don't always necessarily use the means - like k-medians, for example.

Related Solutions

Solved – Hybrid (K-means + Hierarchical ) clustering

Why don't just first look at other clustering algorithms? For example DBSCAN does not need to know the number of clusters, and it can work with any distance function (assuming that you want to go with something like cosine distance for your sparse vectors).

Given the characteristics of your data, k-means is just bound to fail. The problem is that the means will most likely be no longer sparse, so they are actually outliers, and by no means central objects. Don't use k-means with sparse vectors or non-euclidean distances! (k-means may not converge for other distance functions, as the mean might not minimize the objective function anymore!)

Seriously, there are at least 100 more modern clustering algorithms. Try these first.

Solved – K-Means clustering after first iteration

I am not sure if there is a "standard" thing to do in the case one of the initial centroids is completely off.

You can easily test this by specifying the initial centroids and see how things evolve!

For instance, R will just give you an error.

Say you do:

# Set the RNG seed to ensure reproducibility
set.seed(12345)

# Let's create 3 visually distinct clusters
n <- c(1000, 500, 850)
classifier.1 <- c(rnorm(n[1], 10, 0.9), 
                  rnorm(n[2], 25, 2),
                  rnorm(n[3], 35, 2))
classifier.2 <- c(rnorm(n[1], 5, 1),
                  rnorm(n[2], 10, 0.4),
                  rnorm(n[3], 2, .9))

col = c("blue", "darkgreen", "darkred")
# Run k-means with 3 clusters and random initial centroids 
# to check the clusters are correctly recognized
km <- kmeans(cbind(classifier.1, classifier.2), 3)
# Plot the data, colored by cluster
plot(classifier.1, classifier.2, pch=20, col=col[km$cluster])

# Mark the final centroids
points(km$centers, pch=20, cex=2, col="orange")

# Now impose some obviously "wrong" starting centroids
start.x <- c(10, 25, 3000)
start.y <- c(10, 10, -10000)
km.2 <- kmeans(cbind(classifier.1, classifier.2), 
               centers=cbind(start.x, start.y))

Now, R has obviously no issue in discriminating the 3 clusters when you let it choose the initial centroids, but when you run it the second time it will just say:

Error: empty cluster: try a better set of initial centers

I guess that if you are implementing your own algorithm you may choose to use this behaviour or rather give the user a warning and let the algorithm choose the centroids by itself.

Obviously, as others pointed out, there are algorithms such as k-means++ that help in choosing a good set of starting centroids.

Also, in R you can use the nstart parameter of the kmeans function to run several iterations with different centroids: this will improve clustering in certain situations.

EDIT: also, note from the R kmeans help page

The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. For ease of programmatic exploration, k=1 is allowed, notably returning the center and withinss.

Except for the Lloyd–Forgy method, k clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan–Wong method.

Best Answer

Related Solutions

Solved – Hybrid (K-means + Hierarchical ) clustering

Solved – K-Means clustering after first iteration

Related Question