Solved – Looking for a hierarchical-clustering method for multiple data types

clusteringmodel-based-clusteringmultivariate analysisr

I would like to find a hierarchical-clustering method useful to assign a group membership into k groups for all individuals in my dataset. I have considered several classic ordination methods, PCA, NMDS, "mclust", etc., but three of my variables are categorical (see data description below). Further, I was wondering if it is preferable to a method that reports a posterior probability of group membership for each individual? I am using R.

Data description: I have sampled almost 2000 individual birds (single species representing two subspecies or phenotypes) across Sweden. All individuals are adult males. Although this is one species, in middle of Sweden there is a (migratory) divide where the southern individuals presumably migrate to West Africa and north of the divide they presumably migrate to East Africa. There is a zone of overlap approximately 300 km wide at the migratory divide.

Variables:

  • Wing (mm) – continuous
  • Tail (mm) – continuous
  • Bill-head (mm) – continuous
  • Tarsus (mm) – continuous
  • Mass (g) – continuous
  • Colour (9 levels) – categorical
  • Stable carbon-isotopes (parts per mil) – continuous
  • Stable nitrogen-istopes (parts per mil) – continuous
  • SNP WW1 (0, 1, 2) – molecular marker, 0 and 2 are fixed and 1 is
    heterozygote
  • SNP WW2 (0, 1, 2) – molecular marker, 0 and 2 are fixed and 1 is
    heterozygote

Description of the colour variable: (brightest yellow) S+, S, S-, M+, M (medium), M-, N+, N, N- (dullest yellow-grey)

Best Answer

Have a look at OPTICS. It will find hierarchical clusters. You don't need to specify the number of clusters you need (which doesn't make that much sense with hierarchical clusters, actually!). And you can customize the distance function to suite your needs, because obviously euclidean distance is not really sensible here, as a delta of 1 mm and a delta of 1 g is not the same. So you'd first define an appropriate distance function, then run OPTICS with it to obtain clusters.

When OPTICS doesn't find any clusters, that can also indiciate that the data set just doesn't cluster with these parameters (distance, minPts). The results of other algorithms such as k-means can be quite misleading, as they will always force the data set to cluster, and may easily return a random partitioning.

Don't use the weka implementations of OPTICS and DBSCAN. Weka is good for machine learning, but not for clustering. The clusterers are badly integrated, limited in functionality, and essentially unmaintained.