Solved – Robust cluster method for mixed data in R

clusteringmixed type datamodel-based-clustering

I'm looking to cluster a small data set (64 observations of 4 interval variables and a single three-factor categorical variable). Now, I'm quite new to cluster analysis, but I am aware that there has been considerable progress since the days when hierarchical clustering or k-means were the only available options. In particular, it seems that new methods of model based clustering are available that, as pointed out by chl, enable the use of "goodness-of-fit indices to decide about the number of clusters or classes".

However, the standard R package for model based clustering mclust apparently will not fit models with mixed data types. The fpc model will, but has trouble fitting a model, I suspect because of the non-gaussian nature of the continuous variables. Should I continue with the model-based approach? I'd like to continue to use R if possible. As I see it, I have a few options:

  1. Convert the three-level categorical variable into two dummy variables and use mclust. I'm unsure if this will bias the results, but if not this is my preferred option.
  2. Transform the continuous variables somehow and use the fpc package.
  3. Use some other R package I haven't yet encountered.
  4. Create a dissimilarity matrix using Gower's measure and use traditional hierarchical or relocation cluster techniques.

Does the stats.se hivemind have any suggestions here?

Best Answer

I'd recommend you to use Gower with subsequent hierarchical clustering. Hierarchical clustering remains most flexible and appropriate method in case of small number of objects (such as 64). If your categorical variable is nominal, Gower will internally recode it into dummy variables and base dice similarity (as part of Gower) on them. If your variable is ordinal, you should know that latest version on Gower coefficient can accomodate it, too.

As for numerous indices to determine the "best" number of clusters, most of them exist independently of this or that clustering algorithm. You need not to seek for clustering packages that necessarily incorporate such indices because the latter may exist as separate packages. You leave a range of cluster solutions after a clustering package and then compare those by an index from another package.