Solved – Mahalanobis distance in a hierarchical cluster analysis in SPSS

clusteringdistancehierarchical clusteringmulticollinearityspss

I am conducting a hierarchical cluster analysis in SPSS on my database with several neuropsychological and psychiatric variables. In my database, some of my variables (that is, two pairs of variables) have correlations $r > .80$. My first thought was to eliminate these variables from my cluster analysis.
However, another option is to use Mahalanobis distance as the distance measure, because this measure takes the correlation in account (according to, e.g., Multivariate Data Analysis by Hair et al.).

My question is, if there is a way as to perform the hierarchical cluster analysis in SPSS using the Mahalanobis distance? I am not familiar with R or SAS, so my preference would be a method using SPSS.

Best Answer

IBM advises against using the Mahalanobis' distance in clustering. See here.

In hierarchical clustering, you need to define the distance between the clusters (as they are formed) and the remaining unclustered data points. So while the Mahalanobis' distance is a sensible measure between data points, it is hard to generalize it to a measure of the distance between clusters. I think that's the point IBM is trying to make.

I would add that if you really want to base your analysis on the covariance structure of the data, perhaps something like SEM or factor analysis would be the way to go.

Another approach would be to transform the data into PC scores ... then do the cluster analysis on the scores. Yes, this is a kludge, and I don't know how you would explain the results to a client, but it would be a way to adjust for the covariance structure. If the data are meaningfully correlated, then you might be able to reduce the dimensionality by taking the first few PC scores and base your clustering on that.