Correlation – Understanding Pearson Correlation, Covariance, and Cosine Similarity

correlationcosine similaritydescriptive statisticsmathematical-statisticsself-study

In this post, the best answer gives excellent mathmetical explanation among pearson correlation, co-variance and cosine similarity. Where I quote here ($\mathbf A $ is the data matrix).

  • If you center columns (variables) of $\bf A$, then $\bf A'A$ is the scatter (or co-scatter, if to be rigorous) matrix and $\mathbf {A'A}/(n-1)$ is the covariance matrix.
  • If you z-standardize columns of $\bf A$ (subtract the column mean and divide by the standard deviation), then $\mathbf {A'A}/(n-1)$ is the Pearson correlation matrix: correlation is covariance for standardized variables. The correlation is also called coefficient of linearity.
  • If you unit-scale columns of $\bf A$ (bring their SS, sum-of-squares, to 1), then $\bf A'A$ is the cosine similarity matrix. Cosine is also called coefficient of proportionality.

In addition to math explanation, is there any intuitive plot such as pearson correlation in Wikipedia (shown below) to show the relationship between these three "similarity measures", i.e., what kind of shape each similarity metric is able to detect?

enter image description here

Best Answer

We can ignore the matrix formulation, and just consider two vectors $x$ and $y$ (since the matrix formulation is just the vector operation repeated over different pairs of vectors). One intuitive/geometric distinction between covariance/correlation/cosine similarity is their invariance to different transformations of the input. That is, if we transform $x$ and $y$, under what types of transformations will the scores keep the same value?

Covariance subtracts the means before taking the dot product. Therefore, it's invariant to shifts.

Pearson correlation subtracts the means and divides by the standard deviations before taking the dot product. Therefore, it's invariant to shifts and scaling.

Cosine similarity divides by the norms before taking the dot product. Therefore it's invariant to scaling, but not shifts. Geometrically, it can be thought of as measuring the size of the angle between the two vectors (as its name suggests, it's the cosine of the angle).

All of these quantities depend on the dot product, so they can only detect linear structure. To address a question from the comments, mutual information is fully general, and can detect structure for any distribution. But, it's harder to estimate from finite data than other quantities, and more care must be taken. Also, it measures dependence, but doesn't indicate the direction of a relationship (e.g. variables that are correlated or anticorrelated can have the same same mutual information). Mutual information is a valid measure of dependence when no 'direction of relationship' even exists (non-monotonic relationships). If the goal is to detect relationships that are nonlinear but monotonic, then Spearman rank correlation and Kendall's tau are good options.

Related Question