Solved – Does the average distance in K-means have to be monotone decreasing

clusteringk-means

I'm implementing the k-means algorithm myself. I don't see any obvious mistake in my code and it seems to work well. However, there's something I don't understand.

My algorithm, working on vectors $x_i \in X$, cetroids $c_j \in C$ and labels $l_i \in L$ looks like:

  1. choose centers $C=\{c_1,c_2,…,c_k\}$ from $X_i$ randomly (no repetitions)
  2. label each vector with the index of nearest centroid
    $$l_i = \text{arg}\min_{j \in \{1..k\}}(||x_i-c_j||, c_j \in C)$$
  3. find new means of each cluster
    $$c_j = \text{mean}\{x \in X_i \;\text{and}\; l_i=j\}\;\text{for}\;j=1..k $$
  4. if the centroids have moved signifficantly in the previous step, goto 2

A plain vanilla k-means, I guess.

However, when I make a plot of the average of point-to-respective-centroid distances after each iteration,
$$\text{mean}\{||x_i-c_j||\;\text{and}\;j=l_i\;\text{for each}\;x_i \in X \}, $$
I can see that the mean distance is not monotone decreasing (although in general it is decreasing and converging nicely):

http://i.stack.imgur.com/ySgWf.png (sorry it won't allow me embed the image)

I remember to be told it should be always decreasing, and my intuition is the same. What's wrong? My algorithm (implementation) or my assumption about the mean of distances?

Best Answer

K-means minimizes the total sum of squared deviations. Don't think of it in terms of distances. Yes, you assign objects by their distance. But that distance (squaredEuclidean) is just the sum of 1d squared deviations. So what you really are optimizing is the sum of squared 1d deviations.

The key part however is the mean, and here it is more obvious that this method optimizes based on single dimensions. The arithmetic mean is the minimum least squares estimation, again.

So you really should plot the sum of squared deviations, not the mean of euclidean distances.

Yes: euclidean distances and squared euclidean distances are monotone - for a single object.

Example in one dimension:

Objects at 0 and 2, mean at 0.5:

average distance: (0.5 + 1.5) / 2 = 1

squared deviations: 0.25 + 2.25 = 2.5

Object at 0 and 2, mean at 1:

average distance: (1 + 1) / 2 = 1

squared deviations: 1 + 1 = 2

So the second example is only better wrt. squared deviations, not with linear deviations. In particular, an optimum for the linear deviations isn't necessarily optimal for the squared deviations.

Above example highlights that one shouldn't just plug in arbitrary distances into k-means. It technically is not distance based. It is based on squared deviations in each single dimension, which happen to sum up to the squared euclidean distance, which is monotone with the regular euclidean distance, so it appears as if we are assigning every object to the nearest cluster center, although what we are actually doing is minimize the sum of squared 1d deviations.

Related Question