Solved – Euclidean distance score and similarity

distance-functionssimilarities

I'm just working with the book Collective Intelligence (by Toby Segaran) and came across the Euclidean distance score. In the book the author shows how to calculate the similarity between two recommendation arrays (i.e. $\textrm{person} \times \textrm{movie} \mapsto \textrm{score})$ .

He calculates the Euclidean distance for two persons $p_1$ and $p_2$ by
$$d(p_1, p_2) = \sqrt{\sum_{i~\in~\textrm{item}} (s_{p_1} – s_{p_2})^2} $$

This makes completely sense to me. What I don't really understand is why he calculates at the end the following to get a "distance based similarity":

$$ \frac{1}{1 + d(p_1, p_2)} $$

So, I somehow get that this must be the conversion from a distance to a similarity (right?). But why does the formular looks like this? Can someone explain that?

Best Answer

The inverse is to change from distance to similarity.

The 1 in the denominator is to make it so that the maximum value is 1 (if the distance is 0).

The square root - I am not sure. If distance is usually larger than 1, the root will make large distances less important; if distance is less than 1, it will make large distances more important.

Related Solutions

Solved – How to measure distance for features with different scales

A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.

To do this, you just perform two successive column-wise operations on your data:

subtract the mean and
divide by the standard deviation

The rationale of these two operations is to ensure the values have zero-mean (when subtracting the mean in the numerator) and unit-variance (by dividing with the standard deviation in the denominator).

For instance, in NumPy:

>>> # first create a small data matrix comprised of three variables 
>>> # having three different 'scales' (means and variances)

>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A   # the pre-standardized data
    array([[  1.753,  37.809,   1.181],
           [  1.386,   8.333,   0.235],
           [  2.827,  40.5  ,   0.625],
           [  5.516,  47.202,   0.183],
           [  0.599,  27.017,   1.054],
           [  8.918,  35.398,   1.602]])

>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
    array([[ -1.747,   5.099,   0.368],
           [ -2.114, -24.377,  -0.578],
           [ -0.673,   7.79 ,  -0.189],
           [  2.016,  14.493,  -0.631],
           [ -2.901,  -5.693,   0.24 ],
           [  5.418,   2.688,   0.789]])

>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
    array([[-0.606,  0.409,  0.716],
           [-0.734, -1.957, -1.125],
           [-0.233,  0.626, -0.367],
           [ 0.7  ,  1.164, -1.228],
           [-1.007, -0.457,  0.468],
           [ 1.881,  0.216,  1.536]])

Solved – Euclidean Distance b/t unit vectors or cosine similarity where vectors are document vectors

Do not compare the raw numbers.

An Euclidean distance of $x$ and a cosine similarity of $y$ and Euclidean distance on normalized vectors of $z$ live on different scales and just cannot be compared. This is like comparing "shoe size" with "mass in lb". There probably is some relationship, but none that you can see by looking at a small sample.

k-means optimizes Euclidean distances (and it shouldn't really be used with other distances, at the mean may not be a least squares estimator of the central object). So the clusters you computed already favor Euclidean neighbors. If you then refine your search using Cosine distance, it will likely be only within the subset that was already Euclidean-similar.

So a lot of the things you observe probably are artifacts from the clustering preprocessing you did and similar aspects of your code.

Note that when using non-normalized vectors, you essentially put a high weight on the document size. This is why TF-IDF is commonly used: a document and the concatenation of two copies of the same document then have the same vector.

Best Answer

Related Solutions

Solved – How to measure distance for features with different scales

Solved – Euclidean Distance b/t unit vectors or cosine similarity where vectors are document vectors

Related Question