Solved – In clustering, how does Normalized Euclidean Distance represent the number of standard deviations from a cluster

clusteringdistance-functionseuclideanhierarchical clustering

In clustering, one has to choose a distance metric. I've seen Normalized Euclidean Distance used for two reasons:

1) Because it scales by the variance.

2) Because it quantifies the distance in terms of number of standard deviations.

However, I have never seen a convincing proof of 2) nor a good explanation of 2).

How does it quantify the standard deviation in each dimension over which you are clustering?

Best Answer

A general version of this is the Mahalanobis distance. In one dimension you center your variables and then scale them by the standard deviation. The definition of the distance between two points with this metric in one dimension is: $$ d(x,y) = \frac{|x-y|}{s} $$ where $s$ is the standard deviation. So basically you are scaling by the standard deviation and thus one unit in this metric is one standard deviation in the regular Euclidian metric.

This interpretation is possible for vectors if they have coordinates in one dimension or if the standard deviation is the same along all coordinates. Otherwise the unit is not exactly interpretable as standard deviations, but it is normalized so people tend to talk about it as standard deviations in general.

Related Question