Cosine Similarity – How Higher Dimensions Impact Performance Compared to Euclidean Distance

cosine similarityeuclideank nearest neighbournonparametric

I've generated 100 random vectors (data points) in n∈[1,…,50] dimensions. I then compared distances between each pair of vectors and calculated the mean value. I've done this for all dimensions using Euclidean distance and then using cosine similarity.

Also worth noting is that all vectors are in range [0, 1)

This is the graph I get:

enter image description here

From what I've read on the topic, there shouldn't be a significant difference between these two measurements and both of them will be affected by the curse of the dimensionality.

AFAIK cosine similarity is only in the interval from [-1, 1] and in my case (all vectors are positive) the interval will be [0, 1]. Is that why the cosine similarity is "much" smaller than Euclidean distance? And also, how come the cosine similarity curve isn't rising with the number of dimensions?

Note: This is a question in my lab exercise: Which one of these measures would you use in high dimension space?

In my opinion there isn't a clear answer to this.

Best Answer

This makes sense to me. When you add more dimensions, you are, in some sense, giving more ways for the points to be far apart. On the number line, $x_0=0$ and $x_1 = 1$ are only $1$ unit apart. However, if we add a second dimension, $(x_0, y_0) = (0, 0)$ and $(x_1, y_1) = (1, 100)$ are a lot more than one unit apart. However, the angular measurement is always restricted to just the two dimensions spanned by your two points.

We have two questions on here that might interest you.

Why is Euclidean distance not a good metric in high dimensions?

Square loss for "big data"

EDIT

You can decide if this makes you like or dislike cosine distance, but consider the points $(0, 1)\in\mathbb R^2$ and $(1, 0)\in\mathbb R^2$. They have the same cosine distance as $(0, 1)$ and $(2, 0)$, but the Euclidean distances are different.

Related Question