Solved – The right distance for the clustering. Maybe Mahalanobis

clusteringdistance-functions

I have to do a cluster analysis and I'm asking which distance should I used.

I know that 99% of the clustering are made using a euclidean distance, but I heard about the Mahalanobis distance and it seems to be better because it takes into account the covariance matrix of the data.

Question : Why the Mahalanobis distance isn't more used ?

For instance with this data (70% of the variance within these 2 Dim) :enter image description here

The euclidean distance doesn't fit, so does the Mahalanobis distance can better fit ?

Edit : By the euclidean distance doesn't fit I mean the clusters which become apparent haven't a circle shape
enter image description here

Best Answer

The distance measure you use for cluster analysis should depend on your data. For example, in Ecology we frequently use data on species presence/absence/abundance of ecological communities, and use distance (i.e., similarity) measures such as the Sorensen and Bray-Curtis measures.

There should not be anything specifically against using Mahalanobis distance. Euclidean distance may be the most intuitive to use, and perhaps for the field that you are in, it generally works well. However, it does not work well for all datasets. One thing you can do is try different distance measures and different clustering techniques, and compare cophenetic correlations across analyses to see what is showing the pattern best-supported by the data; also, look at the resulting clusters to see what makes sense and is explainable based on existing literature in your field.

Also, there is a relevant post on CrossValidated here - also, a google search for "non-euclidean distance cluster analysis" looks like it brings up some useful results.

Hope that helps a bit!