Solved – K-means Mahalanobis vs Euclidean distance

classificationdistanceeuclideank-meansunsupervised learning

I currently am trying to cluster "types" of changes on bitemporal multispectral satellite images.

I applied a thing called a mad transform to both images, 5000 x 5000 pixels x 5 bands.
Each band is a "variable" as it is radiance information from a different spectrum of light.
This transform is basically equivalent to PC applied to the substraction of both images.

Naturally I can get up to 5 mad components. Now I would like to find this types of change on these components. If I use K-means on the components I would use an euclidean distance but I just wanted to know what could be the gain in using a mahalanobis distance if there is any.

Best Answer

I haven't understood the type of transformation you used, so my answer will be a general one. The short answer is: How much you will gain using Mahalanobis distance really depends on the shape of natural groupings (i.e. clusters) in your data.

The choice of using Mahalanobis vs Euclidean distance in k-means is really a choice between using the full-covariance of your clusters or ignoring them. When you use Euclidean distance, you assume that the clusters have identity covariances. In 2D, this means that your clusters have circular shapes. Obviously, if the covariances of the natural groupings in your data are not identity matrices, e.g. in 2D, clusters have elliptical shaped covariances, then using Mahalanobis over Euclidean will be much better modeling.

You can try both and see whether or not using the Mahalanobis distance gives you a significant gain. It also depends on what you will do after clustering. Clustering itself is usually not the ultimate purpose. You will probably use the clusters in some subsequent processing. So, the choice of Euclidean vs Mahalanobis may be determined by the performance of your subsequent processing.