Solved – K-Means Clustering – Calculating Euclidean distances in a multiple variable dataset

clusteringdata visualizationk-means

I have just completed a simple exercise with 2 variables (X and Y) to understand how K-Means clustering works. The results look like this,

Simple K-Means Clustering Result

My questions is, if I have another column Z, how should the scatter plot be plotted to include the new variable Z? Would a 3 dimensional scatter plot chart be required?

And how do I calculate the Euclidean distance between the coordinates for X,Y and Z?

To clarify, I am not looking for a software like R or Weka to solve the problem but more on understanding the details and how the calculation works.

Best Answer

You are talking about two distinct problems here

How do I visualise what k-means is doing in N>2 dimensions
How do I calculate k-means in N>2 dimensions

The second one is much easier than the first to answer.

To calculate the Euclidean distance when you have X, Y and Z, you simply sum the squares and square root. This works for any number of dimensions

$D=\sqrt{\sum_i X_i^2}$

The first part, visualisation, is much harder, but also has no right answer - it is simply a tool for checking that it is doing what you think it is, and for understanding what is going on. If N gets very large, there is no simple way to do this.

For three dimensions, there are a couple of common approaches, with their own pros and cons:

3D chart: You see things as you do in the real world, but you will really need to be able to rotate the image to get a feel for depth
Colour in the points: This is quite a nice approach, use red say for the lowest Z value and blue for the highest Z value, and then the spectrum between the two. The K-Mean centre will then have the "average" colour of the cluster.

For higher dimensions you have to resort to more approximate techniques:

Slices/projections: Drop one or more dimensions and project or slice onto a lower (i.e. 2D) number of dimensions. This gives you a feel for what's going on, but you'll need a lot of them to check the K-Means are in the centres (and the wrong slices/projections might completely miss the interesting structure)
Dimension reduction: Starting to get really hard work now (much more complex than K-Means itself). You can attempt to use things like PCA, either locally for each cluster, or globally, to find planes that are "interesting" to look at, and just plot those.

Specific to K-Means, and particularly useful when K is low (e.g. 2), you can plot the density of points at distances on the projection between a pair of clusters.

For example, suppose we go back to 2D and have a scatter chart like this:

KMeans Scatter Graph

Where the two big blobs are the KMeans centres, and I have added the line that passes through the two points. If you perpendicularly project each point onto that line, then you can view the distribution of the points around each centre like so:

KMeans dots graphs

Where I have marked on the location of the means with the thick lines. The second graph can be drawn regardless how many dimensions you are working in, and is a way of seeing how well separated the clusters are.

Related Solutions

Solved – How to mathematically prove that k-means clustering converges to minimum squared error

There is no k-means algorithm. K-means is the problem. Algorithms for it include MacQueen, Lloyd/Forgy, Hartigan/Wong and many more.
Most of these algorithms (all but exhaustive search I guess) will only find a local optimum, not the global optimum. They are heuristics. Fast heuristics...
It turns out the usual nearest-center heuristic sometimes even misses the local optimum, if cluster sizes vary much. Not by much. But rarely, points should not be assigned the nearest center.
Global optimium search is IIRC NP-hard, so you do not want to use a perfect algorithm, unless you assume that P=NP.

Solved – R getting 2D coordinates from kmeans

First, let's generate some example data and cluster it:

data <- rFace(1000) 
km <- kmeans(data, 6)

Now, we can use discrproj to find an appropriate projection that separates these clusters

dp = discrproj(data, km$clustering)

The result, dp has several fields that are potentially useful. The field dp$proj contains the coordinates of the original data points, projected onto our new space. This space has the same dimensionality as the original space, but the first two dimensions separate the clusters best (which is what plotcluster actually displays)

Compare:

plot(dp$proj[,1], dp$proj[,2], pch=km$cluster+48, col=km$cluster) #+48 to get labels correct

with:

plotcluster(data, km$clustering)

Suppose you get some new points in your original space. You can project them into your new space using the basis vectors in dp$units, like this:

newpts = newdata %*% dp$units[,1:2]

That should answer your first question. Unfortunately, I think the second part is effectively unanswerable because there are infinitely many points in the 31-d space that correspond to a given point in the 2D space.

Best Answer

Related Solutions

Solved – How to mathematically prove that k-means clustering converges to minimum squared error

Solved – R getting 2D coordinates from kmeans

Related Question