Distance Calculation – Using Count Data or Discrete Probabilities for Calculating Distances?

distance

I am looking into calculating distances between vectors for some data analysis. One question I have is whether I should use actual count data or convert to discrete probabilities.

For some distances, the method is clear from the underlying theory (e.g. the Hellinger Distance). However, for other distances, which approach to use is not so clear. I have different references that use one or the other approach. It seems to be quite a subjective call.

There are many examples I could provide so, for the sake of space and simplicity, let’s take the Soergel Distance $(d_s)$ here. (I understand this is a generalised version of the Jaccard Distance).

$$d_s(\mathbf{x},\mathbf{y})=1-{\frac{\sum_i min(x_i , y_i)}{\sum_i max(x_i , y_i)}}$$

Firstly, let’s play with the following vectors using count data (taken from survey data): $\mathbf{x}=(5,13,17,14,7)$ and $\mathbf{y}=(12,10,15,41,19)$. Completing the equation, we get $d_s = 0.500$

Now converting the count values to discrete probabilities (or, proportions, if one prefers), we have $\mathbf{\hat x}=(0.089,0.232,0.304,0.250,0.125)$ and $\mathbf{\hat y}=(0.124,0.103,0.155,0.423,0.196)$. Completing the equation again, we get $d_s = 0.435$

So which is the ‘true’ Soergel distance between the vectors? Or is the respective distance ‘valid’ for each approach, which means stating the context is as critical as stating the distance?

Best Answer

As said in the answer by Jacques, the distance between the vectors will depend on whether counts or proportions are used. I have not some across any reference that explicitly states which approach is the best or, more correctly, which is the most appropriate.

A couple of observations may be able to guide you. Firstly, the distance for counts is strongly conditional (biased) on the total count of observations. It is possible to get a situation where the minimum values could be all represented in vector $\mathbf{A}$ with all the maximum values contained in vector $\mathbf{B}$. For example, let $\mathbf{A}=(5,7,3,6,10)$ and $\mathbf{B}=(7,12,9,12,14)$. [In a (multi)set theoretic sense, $A \subset B$]. The Soergel Distance for the count data between $\mathbf{A}$ and $\mathbf{B}$ is 0.426

When converting to proportions, the Soergel Distance comes out to be 0.179. Given this distance has the bounds of $[0,1]$, this is clearly a 'notable' difference. Also, from a set theoretic view, the proportional approach $A \not\subset B$. Furthermore it seems counterintuitive that while the count data is a subset, it is 'further apart' than the proportions which have overlapping sets.

In my view, I would choose the proportions approach as it is less influenced by the differences in the total counts of each vector. And proportions, being a proxy for probabilities, offer better comparability. The count approach is more applicable when the total counts of each vector are 'roughly similar'. However, as to what is accounts for 'roughly similar', that's were subjectivity and a good dose of pragmatism comes in, not to mention professional/academic judgement.

Related Question