Solved – Choosing a measure of similarity to quantify similarity between individuals on a set of personality scales

distance-functionsinformation retrieval

I have a bunch of users. Each user has a number of personality attributes, such as "fitness level" or "eco-consciousness", rated on a scale from 1 to 5. I want to calculate how similar two users are, so I can show each user a sorted list of "most similar users".

This seems to be a classic IR problem, and I've seen three different metrics used, but no discussion of why to choose one over the other:

Simple Arithmetic. The scores are already normalized to the same scale, so I can just add each user's scores up, and compare the sums to see who is most similar.
Cosine Similarity. Treat each user as an n-dimensional vector, where each scale is one dimension. Calculate the cosine of the angle between two users' vectors; cosines closer to 1 (smaller angles) are more similar.
Euclidean Distance. Each user is an n-dimensional vector again, but this time, calculate the distance between endpoints. Users that are close together are similar.

What are the advantages and disadvantages of each method? How does that change if the scores are not normalized to the same scale (i.e. if I add an "age" attribute)?

Best Answer

Align similarity formula with conceptual notion of similarity

I would try to align the mathematical formula used to calculate similarity with what you intuitively or theoretically mean by similarity. A few conceptual issues include:
- Do you standardise within person and thereby focus on the profile of scores rather than the raw differences?
- Do you want to square differences on individual attributes (rather than take absolute differences) and thereby weight a few larger differences more than a larger number of small differences?
- Do you to weight differences on each variable equally or let some count for more; and if some are to count for more, are you going to decide how this operates or will you let something like the standard deviation of the attribute determine this?

Inspect groupings produced by different similarity formulas

It's a good idea to inspect the similarity based groupings produced by different algorithms. See whether they converge or differ. Inspect disagreements and see which algorithm seems to map on to your conceptual definition of similarity.

A starting point

I think euclidean or squared euclidean distance based on standardised variables is often a good option (it would be my starting point for your problem). Of course, if all your variables have a similar standard deviation (as is often the case when comparing likert items), then whether you standardise or not probably will not make much difference.

Related Solutions

Solved – How to measure distance for features with different scales

A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.

To do this, you just perform two successive column-wise operations on your data:

subtract the mean and
divide by the standard deviation

The rationale of these two operations is to ensure the values have zero-mean (when subtracting the mean in the numerator) and unit-variance (by dividing with the standard deviation in the denominator).

For instance, in NumPy:

>>> # first create a small data matrix comprised of three variables 
>>> # having three different 'scales' (means and variances)

>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A   # the pre-standardized data
    array([[  1.753,  37.809,   1.181],
           [  1.386,   8.333,   0.235],
           [  2.827,  40.5  ,   0.625],
           [  5.516,  47.202,   0.183],
           [  0.599,  27.017,   1.054],
           [  8.918,  35.398,   1.602]])

>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
    array([[ -1.747,   5.099,   0.368],
           [ -2.114, -24.377,  -0.578],
           [ -0.673,   7.79 ,  -0.189],
           [  2.016,  14.493,  -0.631],
           [ -2.901,  -5.693,   0.24 ],
           [  5.418,   2.688,   0.789]])

>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
    array([[-0.606,  0.409,  0.716],
           [-0.734, -1.957, -1.125],
           [-0.233,  0.626, -0.367],
           [ 0.7  ,  1.164, -1.228],
           [-1.007, -0.457,  0.468],
           [ 1.881,  0.216,  1.536]])

Solved – Euclidean Distance b/t unit vectors or cosine similarity where vectors are document vectors

Do not compare the raw numbers.

An Euclidean distance of $x$ and a cosine similarity of $y$ and Euclidean distance on normalized vectors of $z$ live on different scales and just cannot be compared. This is like comparing "shoe size" with "mass in lb". There probably is some relationship, but none that you can see by looking at a small sample.

k-means optimizes Euclidean distances (and it shouldn't really be used with other distances, at the mean may not be a least squares estimator of the central object). So the clusters you computed already favor Euclidean neighbors. If you then refine your search using Cosine distance, it will likely be only within the subset that was already Euclidean-similar.

So a lot of the things you observe probably are artifacts from the clustering preprocessing you did and similar aspects of your code.

Note that when using non-normalized vectors, you essentially put a high weight on the document size. This is why TF-IDF is commonly used: a document and the concatenation of two copies of the same document then have the same vector.