Solved – Cosine-Similarity vs non-linear measures

correlationcosine similaritynatural languagenonlinearnonlinearity

In NLP, people often use cosine similarity to measure how close two vector spaces are to each other. However, we know that cosine-similarity is the same thing as Pearson correlation, for centered vectors (Is there any relationship among cosine similarity, pearson correlation, and z-score?). To me, this means that we can view each vector as a random variable, and the values of the vector as realizations of the underlying distribution of that random variable.

In that case, we also know that correlation only measures linear dependence. So, my question is, why would cosine similarity be preferred to perhaps nonlinear measures of association between random variables, such as distance correlation (https://en.wikipedia.org/wiki/Distance_correlation)?

Best Answer

Cosine similarity is not a measure of (the strenght of) linear association like Pearson r is, it is a measure of proportional association which is a narrower definition. The difference is in centration: r is cosine for centered data.

Cosine similarity is a measure of proportionality: if points of a bivariate data cloud lie on a straight line coming from the coordinates origin then cosine similarity is maximal, $cos_{xy}=1$. If that straight line of points does not come through the origin or if the points deviate from lying on a straight line then $cos_{xy}$ gets smaller. Because Pearson $r$ is $cos$ of the cloud centered by both axes a straight line of points would always pierce the origin, and therefore for $r$ only deviations from points' lying on the straight line can decrease the coefficient: correlation is the extent of linearity. When $cos$ is $1$ $r$ is also $1$ and full linearity is observed, however if $r$ is $1$ $cos$ is not necessarily $1$: full linearity is not enough for $cos$ to be max. $cos$ is anchored by an "external" point, the origin, $r$ is anchored only to the data cloud itself as represented by its mean.

From regressional standpoint, both $r$ and $cos$ are the $R_{regr}=\sqrt{(1-SS_{resid}/SS_{tot})}$, but $cos$ is about regression w/o intercept, i.e. with the regression line forced to come through the origin and $SS_{tot}$ are deviations from Y=0, not from Y=mean.

$Cos$ and $r$ are, respectively, the scalar product and the covariance, from which the coefficient's sensitivity to the variables' scale or magnitude has been removed.

So, cosine similarity and Pearson r aren't things to mix up in the question "what do they measure", as are covariance and Pearson r, too.

As for distance correlation, the idea behind it is different from both cosine or r. It captures the notion of generalized association - linear, nonlinear, curvilinear, and the notion is from the viewpoint of stochastic independence. With normal bivariate population zero Pearson r tells of the stochastic independence. Distance correlation generalizes to any distribution, and it does not center the data to its mean (because, at the "double centering" operation, euclidean distances are taken not squared).