Solved – Ecological mixed data cluster analysis: Transformations required? Use K-means or hierarchical methods

clusteringdata transformationecologyk-meansmixed type data

I am trying to identify habitat types from 85 plots. I intend to do a cluster analysis to identify habitat types, and hope to fit additional plots into the identified clusters.

(For context, I took measures from habitat plots in several different habitat types across a study site, then also measured the same variables at animal locations. I hope to identify differential habitat selection by two different species.)

  1. Do I need to apply transformations to the data before doing cluster analysis?

    • My data set includes categorical (eg. substrate type: mud, gravel, etc), euclidean distance (0 – 3400 cm), calculated indices (0 – 1.0), and vegetation percent cover (0 – 100, with lots of zeros) variables. Each of these would require different transformations to meet assumptions with other modelling methods, but what about when clustering?? Is each type considered on its own scale? Also, some of my variables are collinear – should these be removed before cluster analysis as with other methods?
  2. Hierarchical or K-means methods?

    • I had intended to use Gower dissimilarity matrix for a hierarchical cluster analysis, but is there an obvious reason to use K-means methods? I'm wondering if my choice of method here will impact my ability to 'fit' additional datapoints.

I am using R.

Best Answer

You will first need to get a working similarity measure. You can't just throw these attributes together and hope that Euclidean distance on the vector will work. It won't.

K-means is only appropriate for Euclidean distance. It relies on the means to minimize variance, otherwise it may not converge. Plus, it doesn't work well with many attributes (dimensions). But you might want to look at more modern methods than hierarchical clustering and k-means. Definitely choose an algorithm/implementation that can work with arbitrary distance functions, as you probably will need to spend a lot of time on fine-tuning your similarity measure.

A common approach (for numerical data) is to use the z-scores of all attributes and then Euclidean. But there are just so many situations one can come up where this is nothing but a crude heuristic. You really need to consider how to measure "habitat similarity". The clustering algorithm needs this as "input", it does not infer this automagically, because it cannot.

An even simpler approach is to rescale all attributes by $\frac{a - a_\min}{a_\max - a_\min}$ to get them into the unit interval $[0:1]$. Then again, use Euclidean distance. Gower's similarity coefficient is along these lines (but with Manhattan distance).

Essentially, both of these methods try to weight attributes equally (with different notions of what "equal" means). It is a reasonable heuristic if you do not know what the attributes denote or how they scale. But assuming that you have attributes which scale exponentially or logarithmically (say, "volume" vs. "length"), this heuristic will perform bad.