Solved – K-Means Clustering – Calculating Euclidean distances in a multiple variable dataset

clusteringdata visualizationk-means

I have just completed a simple exercise with 2 variables (X and Y) to understand how K-Means clustering works. The results look like this,

Simple K-Means Clustering Result

My questions is, if I have another column Z, how should the scatter plot be plotted to include the new variable Z? Would a 3 dimensional scatter plot chart be required?

And how do I calculate the Euclidean distance between the coordinates for X,Y and Z?

To clarify, I am not looking for a software like R or Weka to solve the problem but more on understanding the details and how the calculation works.

Best Answer

You are talking about two distinct problems here

  1. How do I visualise what k-means is doing in N>2 dimensions
  2. How do I calculate k-means in N>2 dimensions

The second one is much easier than the first to answer.

To calculate the Euclidean distance when you have X, Y and Z, you simply sum the squares and square root. This works for any number of dimensions

$D=\sqrt{\sum_i X_i^2}$

The first part, visualisation, is much harder, but also has no right answer - it is simply a tool for checking that it is doing what you think it is, and for understanding what is going on. If N gets very large, there is no simple way to do this.

For three dimensions, there are a couple of common approaches, with their own pros and cons:

  • 3D chart: You see things as you do in the real world, but you will really need to be able to rotate the image to get a feel for depth
  • Colour in the points: This is quite a nice approach, use red say for the lowest Z value and blue for the highest Z value, and then the spectrum between the two. The K-Mean centre will then have the "average" colour of the cluster.

For higher dimensions you have to resort to more approximate techniques:

  • Slices/projections: Drop one or more dimensions and project or slice onto a lower (i.e. 2D) number of dimensions. This gives you a feel for what's going on, but you'll need a lot of them to check the K-Means are in the centres (and the wrong slices/projections might completely miss the interesting structure)
  • Dimension reduction: Starting to get really hard work now (much more complex than K-Means itself). You can attempt to use things like PCA, either locally for each cluster, or globally, to find planes that are "interesting" to look at, and just plot those.

Specific to K-Means, and particularly useful when K is low (e.g. 2), you can plot the density of points at distances on the projection between a pair of clusters.

For example, suppose we go back to 2D and have a scatter chart like this:

KMeans Scatter Graph

Where the two big blobs are the KMeans centres, and I have added the line that passes through the two points. If you perpendicularly project each point onto that line, then you can view the distribution of the points around each centre like so:

KMeans dots graphs

Where I have marked on the location of the means with the thick lines. The second graph can be drawn regardless how many dimensions you are working in, and is a way of seeing how well separated the clusters are.