Solved – What does minimising the loss function mean in k-means clustering

clusteringk-meansloss-functions

I am learning about the k-means clustering algorithm, and I have read that the algorithm is "Trying to minimise a loss function in which the goal of clustering is not met".

I understand the basic concept of the algorithm, which initialises arbitrary centroids/means in the first iteration and then assigns data points to these clusters. The centroids are then updated after the points are all assigned, and points are re-assigned again. The algorithm continues to iterate until the clusters do not change anymore. The algorithm tries to minimise the within-cluster sum of squares (WCSS) value which is a measure of the variance within the clusters.

However, I am having trouble understanding what is meant by a loss function in the context of this algorithm. Any insights are appreciated.

Best Answer

Given $n$ points $\{x_i\}_1^n$ and a known number of clusters $k$, I think a possible loss function would be something like: $$L(c_1,...,c_k) = \sum_{i=1}^n \min_j || x_i - c_j ||^2 .$$ This would be the loss function for the k-means problem but it doesn't mean the the k-means algorithm is explicitly trying to decreases this loss (like a gradient descent would).

Related Solutions

Solved – Comparison of k-means clustering output

As k-means on multiple runs will find different local minima, they can pretty much vary arbitrarily much. On contrary, if two values are close but not identical, I'd consider it much more likely that there is some slight error in one of the two implementations.

If there are multiple local minima, multiple runs with different seedings should give you a number of candidates so there is a high chance of actually finding the same result.

But in the end, k-means is so simple, and such a crude heuristic, what good is it to compare two results? On many data sets it still pretty much a random partitioning; optimized for a local minimum but still meaningless.

Solved – K-Means clustering after first iteration

I am not sure if there is a "standard" thing to do in the case one of the initial centroids is completely off.

You can easily test this by specifying the initial centroids and see how things evolve!

For instance, R will just give you an error.

Say you do:

# Set the RNG seed to ensure reproducibility
set.seed(12345)

# Let's create 3 visually distinct clusters
n <- c(1000, 500, 850)
classifier.1 <- c(rnorm(n[1], 10, 0.9), 
                  rnorm(n[2], 25, 2),
                  rnorm(n[3], 35, 2))
classifier.2 <- c(rnorm(n[1], 5, 1),
                  rnorm(n[2], 10, 0.4),
                  rnorm(n[3], 2, .9))

col = c("blue", "darkgreen", "darkred")
# Run k-means with 3 clusters and random initial centroids 
# to check the clusters are correctly recognized
km <- kmeans(cbind(classifier.1, classifier.2), 3)
# Plot the data, colored by cluster
plot(classifier.1, classifier.2, pch=20, col=col[km$cluster])

# Mark the final centroids
points(km$centers, pch=20, cex=2, col="orange")

# Now impose some obviously "wrong" starting centroids
start.x <- c(10, 25, 3000)
start.y <- c(10, 10, -10000)
km.2 <- kmeans(cbind(classifier.1, classifier.2), 
               centers=cbind(start.x, start.y))

Now, R has obviously no issue in discriminating the 3 clusters when you let it choose the initial centroids, but when you run it the second time it will just say:

Error: empty cluster: try a better set of initial centers

I guess that if you are implementing your own algorithm you may choose to use this behaviour or rather give the user a warning and let the algorithm choose the centroids by itself.

Obviously, as others pointed out, there are algorithms such as k-means++ that help in choosing a good set of starting centroids.

Also, in R you can use the nstart parameter of the kmeans function to run several iterations with different centroids: this will improve clustering in certain situations.

EDIT: also, note from the R kmeans help page

The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. For ease of programmatic exploration, k=1 is allowed, notably returning the center and withinss.

Except for the Lloyd–Forgy method, k clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan–Wong method.

Best Answer

Related Solutions

Solved – Comparison of k-means clustering output

Solved – K-Means clustering after first iteration

Related Question