Solved – Why is the Manhattan distance (or block distance) appropriate when I have a discrete data set

clusteringdatasetdiscrete datadistanceeuclidean

Why is the Manhattan distance (or block distance) appropriate when I have a discrete data set and the Euclidean distance is appropriate when I have continuous numerical variables?

Thanks for reply

Best Answer

I wouldn't always agree with that statement. but here is an intuition why probably the default should be this way.

  • Euclidean distance is 'as the crow flies'. It is the shortest path if you continuously linear interpolate between these points. The distance from 0 to (1,1) passes through (0.5,0.5) and the distance is $\sqrt{2}$.
  • Manhattan distance assumes independent attributes. In particular, no continuous linear independence. If you have a discrete variable, Manhattan is the number of steps you have to do there.

But that is just an intuition, a "rule of thumb"

For example a discrete histogram, I'd rather use a divergence distance.

Or if you consider "money" to be discrete (I would not suggest to do this here), and the number of products bought. There will usually be some linear dependence between these attributes, then Euclidean is likely more appropriate (not that it would make any sense to apply a distance function on quantity,cost without careful feature engineering.)