Cosine Similarity – Does It Depend on Magnitude in Word2Vec?

machine learningvector analysisvector-spacesvectors

WAIT! STOP! Before you slap a comment on my post saying that cosine similarity is not influenced by magnitude (I thought the same until recently). I want you to see why I am asking my question, and course correct me! I feel like I am just not understanding something correctly, which makes me believe that what I am seeing is cosine similarity changing due to magnitude.

This is a very hard question for me to word, so bare with me as I try and make it easier. I thought better to explain more than less (probably a bad motto to live by to some people).

But for those who don't want to spend a while taking a peek into thought processes in my mind here's a-

TLDR:
If I am a personified model that wants to move a vector $c = (1,4,3,4)$, to have the exact same cosine similarity of a target vector $v = (4,8,6,10)$ by only changing $c$ the minimum amount. I can use the fact that cosine similarity is also about how similar two separate vectors proportions are to each other. So the closest vector to $c$ that matches the proportions of $v$ is a vector $t = (2,4,3,5)$.

So if I am only allowed to change one dimension of $c$ at a time, I need to figure out which dimension is the best to change to get my proportionality of $c$ to match $v$ to raise my cosine similarity. You would think that it does not matter which dimension I choose to change as long as it moves me towards vector $t$ (if no matter what dimension I change, it can only be by $1$ unit). Because the cosine similarity should be the same either way, as long as $c$ is moving closer to the "golden" proportional vector $t$. But it doesn't seem to work that way. For example if $c' = (2,4,3,4)$ then it will have a cosine similarity of $0.994$, but if $c' = (1,4,3,5)$ then the cosine similarity is less, at $0.990$. So even though both versions of $c'$ I can choose, have the same difference in proportion from being the exact proportion of $v$, they are said to have different cosine similarity scalars from $v$. So even though both $c'$ options seem like valid approaches to reach the $v$ vectors proportions (they both require the same adjustment in one dimension), why does depending on cosine similarity make a model decide the first option of $c'$ would be the path better taken. Is there a concrete reason why $c' = (2,4,3,4)$ is better than $c' = (1,4,3,5)$.

This is important to me because if I want to know what decisions my model is making under the hood (from an abstract point of view), I want to know how it would decide between these options for a vector $c'$ that is closer to $t$ (remember $t$ is the closest point to $c$ that will match the proportions of $v$). This will help me with my other want, to know how a word embedding will move from its initial randomized dimensions towards it's final word embedding position in each training instance within each epoch, due to wanting to minimize the change in these cosine similarities between vectors. And at an even more conceptual level, I want to know why people say that magnitude does not matter in cosine similarity when it seems to matter still for decisions like this.

I am sorry if this TLDR was also long, its so hard for me to be concise on this topic, in such a way that an answer you could give me, would satisfy my worries!

WHY I AM ASKING:
So I have been delving deep into Word2Vec recently which has cosine similarity written all over it. And I have been looking at unique ways to represent the word-embedding vectors (in high dimensional vector space) changing over each training instance in each epoch. Since they seem to not have any inherent value to us if we just stare at them. Now I have done so much to get to the point that I am at right now. But what is important to this question, is that on the way, I sort of developed an understanding that cosine similarity can be thought of as how similar two vectors proportions are to each other. This is because the cosine of the angle between two vectors becomes larger (closer to $1$) as the two vectors proportions get closer. And the opposite is true, if two vectors proportions between each other get further away, then the cosine of their angle will get smaller (closer to $-1$). This is because the angle itself is influenced by how close (similar) the proportionalities of the vectors are to each other.

THE DISCONNECT: So this is all fine and dandy because cosine similarity is still only caring (allow me to personify cosine similarity) about proportions between two vectors, NOT their magnitudes. For example, vectors;

$(2,3,4,5)$, $(4,6,8,10)$, and $(8,12,16,20)$

will all still have the same cosine similarity to each other, no matter their magnitude. BUT this is where the tricky part comes in. And the best way I can explain it, is through another example.

Let's say you have a target vector of $T = (4,8,6,10)$. And you also have a context vector $C = (1,4,3,4)$ nearby that wants to match it's proportions to the target vector (in other words raise it's cosine similarity), with a starting cosine similarity of $0.986$. But the context vector can only change one dimension at a time (by $1$ unit), there are two good options for the context vector to get closer to the target vector's proportions (raise the cosine similarity between the two). You can either make the vector $Opt1 = (2,4,3,4)$ or $Opt2 = (1,4,3,5)$.

So, which is better? Well this is the bombshell of my whole argument. If you go by "choose the one that raises the cosine similarity the most", something odd happens. The cosine similarity of $Opt1$ to $T$ is $0.994$. The cosine similarity of $Opt2$ to $T$ is $0.990$. Do you see it? Yes both of the options increase the cosine similarity, but $Opt1$ increases it more than $Opt2$! Both options have the same difference in proportion from $T$ (they both increase the same number of dimensions they have matching the proportionality to $T$), but because $Opt1$ increased the dimension with a smaller initial magnitude, it made the cosine similarity better than $Opt2$ increasing the dimension with a larger initial magnitude ($Opt1 = 1 \rightarrow 2$ and $Opt2 = 4 \rightarrow 5$). So even though the dimensions in $Opt1$ and $Opt2$ individually both increase by just $1$, the vector with the smaller dimensional magnitude AFTER increasing by $1$, has the better cosine similarity.

SO WHAT DO I THINK ABOUT THIS? (WHY I THINK IT HAPPENED): Contrary to popular belief, cosine similarity isn't some math magic, that takes in two vectors and spits out how similar they are. It uses math just like anything else (I really hope I am using the proper math in this post before making a bold statement like that, haha!). So on that same note, contrary to popular belief, cosine similarity in fact DOES use magnitude, this is because the denominator normalizes the vector into a unit vector (all dimensions summed add to $1$). The only problem with this (and why I think the oddness of the change in cosine similarity happens) is that normalization technique used to do this is the Euclidean Norm.

The Euclidean Norm for a vector $v$ with $dn$ dimensions is: $$v = \sqrt{(d_1)^2 + (d_2)^2 + … + (d_n)^2}$$

As you can see, the squares of the dimensions are summed BEFORE taking the square-root of the dimensions. And $x^2$ is quadratic. This means that as your input increases in the Euclidean Norm function (denominator) linearly ($1 \rightarrow 2$ and $4 \rightarrow 5$) your output in the denominator increases at a faster rate ($2^2 – 1^2 = 3$ but $5^2 – 4^2 = 9$), a disproportionate growth.

Let's take a moment and look at the numerator of cosine similarity. The dot product between $T$ and both $Opt1$ and $Opt2$ grow at a linear rate, as input increases by the same multiple, the output increases by the same multiple (math is below).

NUMERATOR INPUT (DOT PRODUCT OF $T$ WITH LINEARLY INCREASING DIMENSION IN $Opt2$):

$(1,4,3,5) \rightarrow (1,4,3,6) \rightarrow (1,4,3,7) \rightarrow (1,4,3,8)$

NUMERATOR OUTPUT:

$104 \rightarrow 114 \rightarrow 124 \rightarrow 134$

INCREASE IN OUTPUT (FROM PREVIOUS OUTPUT):

$10 \rightarrow 10 \rightarrow 10$

You can see that this is linear. Now let's look at how the denominator, in cosine similarity, grows with a linear input.

DENOMINATOR INPUT (EUCLIDEAN NORM OF $T$ MULTIPLIED EUCLIDEAN NORM OF $Opt2$):

$(1,4,3,5) \rightarrow (1,4,3,6) \rightarrow (1,4,3,7) \rightarrow (1,4,3,8)$

DENOMINATOR OUTPUT:

$14.69 \cdot 7.14 \rightarrow 14.69 \cdot 7.87 \rightarrow 14.69 \cdot 8.66 \rightarrow 14.69 \cdot 9.48$

INCREASE IN OUTPUT (FROM PREVIOUS OUTPUT):

$10.72 \rightarrow 11.6 \rightarrow 12.04$

You can see that this is non-linear growth (no matter how small it is).

So this means that in the cosine similarity equation, you will have a numerator that grows linearly being divided by a denominator that is growing non-linearly ($2^2 = 4$ but $5^2 = 25$). This means the cosine similarity function exhibits a non-linear growth (it is growth in this case because the angle is acute, meaning the cosine of the angle will be positive). So with a larger input, your numerator with be linearly larger, but the denominator will be magnitudes larger. Meaning the change (increase) in your cosine similarity will be smaller (we want larger increase), the larger your input.

SO IN THE END:
In the vector $(1,4,3,4)$, if you want to have a higher cosine similarity with vector $(4,8,6,10)$ by matching its proportion. Going from the magnitudes $1 \rightarrow 2$ is more beneficial for cosine similarity (if you want a higher cosine similarity) than going from the magnitudes $4 \rightarrow 5$ And that is why I think that cosine similarity does depends on magnitude.

MY QUESTION:
Why does everyone say that cosine similarity does not depend on magnitude? When it seems that it does (if you are choosing what dimension to change in a vector to increase cosine similarity)! Am I correct in my thinking? Did I make a massive error or miss-step? Did I completely blind side a totally obvious answer? Please let me know! I would like to see if I am the scary math magician after all and I was doing math magic from the start (incorrect math).

Best Answer

The key here is that the cosine similarity measure $$c(u,v) = \frac{\langle u,v\rangle}{\|u\|\|v\|}$$ remains the same if only the magnitude of one of the vectors is modified. Let $w = \alpha u$ where $\alpha > 0$, then $$c(w,v) = \frac{\langle w,v\rangle}{\|w\|\|v\|} = \frac{\langle \alpha u,v\rangle}{\|\alpha u\|\|v\|} = \frac{\alpha}{\alpha}\frac{\langle u,v\rangle}{\|u\|\|v\|} = \frac{\langle u,v\rangle}{\|u\|\|v\|} = c(u,v).$$ Notice that every element of $u$ is scaled by the same factor $\alpha$ and the relative sizes of the elements (which determines the direction of the vectors) remain the same.

Related Solutions

Dot Product and Cross Product – Why is the Dot Product Defined as a Scalar and the Cross Product a Normal Vector?

The dot product is the special case of a more general concept, the inner product. If you have a vector space $ V $ over the reals or the complex numbers, then an inner product is a map $ f : V \times V \to \mathbb{C} $ or $ f : V \times V \to \mathbb{R} $ which is conjugate symmetric, positive definite, and linear in its first argument. We usually write $ f(u, v) = \langle u, v \rangle $, in which case these properties can be summed up as follows:

Conjugate symmetry: $ \overline{\langle u, v \rangle} = \langle v, u \rangle $, where $ \bar{z} $ denotes complex conjugation. Note that this implies $ \langle u, u \rangle $ is always real for any vector $ u $.
Positive definiteness: $ \langle v, v \rangle \geq 0 $ for any $ v \in V $, with equality holding iff $ v = 0 $.
Linearity in the first argument: $ \langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle $ where $ u, v, w \in V $ and $ \alpha, \beta $ are in the field of scalars.

If $ V = \mathbb{R}^n $, then we can fix a basis $ B = \{ b_i \in \mathbb{R}, 1 \leq i \leq n \} $ and define $ \langle b_i, b_i \rangle = 1 $ and $ \langle b_i, b_j \rangle = 0 $ for $ i \neq j $. Extending this to all of $ \mathbb{R}^n $ by linearity gives us

$$ \left \langle \sum_{k=1}^{n} c_k b_k, \sum_{j=1}^{n} d_j b_j \right \rangle = \sum_{1 \leq k, j \leq n} d_k c_j \langle b_i, b_j \rangle = \sum_{i=1}^{n} c_i d_i $$

where positive definiteness is readily verified. You will recognize this expression as the definition of the dot product. Indeed, if we take our basis $ B $ to be the standard basis of $ \mathbb{R}^n $, then this inner product is the dot product.

Why is this formalism more powerful? A result about the inner product is the Cauchy-Schwarz inequality, which says that $ |\langle u, v \rangle| \leq |u| |v| $ where $ |u| = \sqrt{\langle u, u \rangle} $. This tells us that

$$ -1 \leq \frac{\langle u, v \rangle}{|u| |v|} \leq 1 $$

assuming that our field of scalars is $ \mathbb{R} $. We then see that the arccosine of this expression is well-defined, so we can define the angle between nonzero vectors $ u $ and $ v $ as

$$ \theta = \arccos \left( \frac{\langle u, v \rangle}{|u| |v|} \right) $$

The properties we expect to be true are then easily verified. This notion extends to infinite dimensional vector spaces over $ \mathbb{R} $, where defining angle is not at all obvious. It is then trivially true that we have $ \langle u, v \rangle = |u| |v| \cos(\theta) $, since that is how $ \theta $ was defined.

The cross product is an entirely separate concept which allows us to find a vector orthogonal to two given vectors in $ \mathbb{R}^3 $. In addition, its magnitude also gives the area of the parallelogram spanned by the vectors. These properties can be taken as the definition of the cross product (with appropriate care for orientation), or they can be derived as theorems starting from the algebraic definition.

[Math] Scale cosine similarity between vectors to range 0, 1

Please check out the wiki page: cosine_similarity_wiki, there it discusses how to convert the cosine similarity to angular distance.

In numpy:

import numpy as np
angular_dis = np.arccos(cos_sim) / np.pi

You can also see it as the answer with 0 votes on the post: stackoverflow_post.

Best Answer

Related Solutions

Dot Product and Cross Product – Why is the Dot Product Defined as a Scalar and the Cross Product a Normal Vector?

[Math] Scale cosine similarity between vectors to range 0, 1

Related Question