Solved – Which distance to use? e.g., manhattan, euclidean, Bray-Curtis, etc

distanceeuclidean

I am not a community ecologist, but these days I am working on community ecology data.

What I couldn't understand, apart from the mathematics of these distances, is the criteria for each distance to use and in what situations it can be applied. For instance, what to use with count data? How to convert slope angle between two locations into a distance? Or the temperature or rainfall at two locations? What are the assumptions for each distance and when does it make sense?

Best Answer

Unfortunately, in most situations there is not a clear-cut answer to your question. That is, for any given application, there are surely many distance metrics which will yield similar and accurate answers. Considering that there are dozens, and probably hundreds, of valid distance metrics actively being used, the notion that you can find the "right" distance is not a productive way to think about the problem of selecting an appropriate distance metric.

I would instead focus on not picking the wrong distance metric. Do you want your distance to reflect "absolute magnitude" (for example, you are interested in using the distance to identify stocks that have similar mean values), or to reflect overall shape of the response (e.g. stock prices that fluctuate similarly over time, but may have entirely different raw values)? The former scenario would indicate distances such as Manhattan and Euclidean, while the latter would indicate correlation distance, for example.

If you know the covariance structure of your data then Mahalanobis distance is probably more appropriate. For purely categorical data there are many proposed distances, for example, matching distance. For mixed categorical and continuous Gower's distance is popular, (although somewhat theoretically unsatisfying in my opinion).

Finally, in my opinion your analysis will be strengthened if you demonstrate that your results and conclusions are robust to the choice of distance metric (within the subset of appropriate distances, of course). If your analysis changes drastically with subtle changes in the distance metric used, further study should be undertaken to identify the reason for the inconsistency.

Related Question