Machine Learning Distance Functions – How to Calculate Similarity Metrics for More Than Two Vectors

cosine similaritydistancedistance-functionsmachine learning

I am aware of Cosine Similarity which measure the angle between "two" vectors. Prototype for cosine similarity would look something like this:

float cosine_similarity(vector<float> a, vector<float> b);

Are there any similarity measures that measure the similarity between "n" vectors? Prototype of this function would look something like this:

float cosine_similarity(vector<vector<float>> vectors);

PS: I am NOT looking for clustering/classification sort of a thing.

I will appreciate if someone can point out in some direction.

Thanks!

Best Answer

The cosine similarity between two column vectors $x_1$ and $x_2$ is simply the dot product between their unit vectors $$\mathrm{CosSim}[x_1,x_2]=\frac{x_1}{\|x_1\|}\bullet\frac{x_2}{\|x_2\|}$$ and varies from -1 to +1, similar to a correlation coefficient $R$ (to which it is closely related).

There are many ways to generalize this idea to a set of $n$ vectors, but as requested in the comments, here I will expand on one possibility, related to the idea of (linear) dimensionality reduction.

First, in the $n=2$ case, we can think of $R^2\in[0,1]$ as the "fraction of the variance explained" by a linear regression. For the squared cosine similarity, a similar interpretation is possible, except the associated regression has no "intercept term" (i.e. the vectors are scaled, but not centered).

In the $n$ dimensional case, we can concatenate the vectors into a matrix $$X=\begin{bmatrix}x_1 \, \ldots \, x_n\end{bmatrix}$$ analogous to the cosine similarity case, we can consider dot products based on the scaled matrix $$Y=\begin{bmatrix}y_1 \, \ldots \, y_n\end{bmatrix}$$ assembled from the corresponding unit vectors $$y_i=\frac{x_i}{\|x_i\|}$$ In general we could then conduct varying analyses on the "correlation matrix" $Y^TY$, including dimension reduction via PCA.

The general idea here is to create a low-rank approximation to the matrix $Y$ by truncating its singular value decomposition (SVD). Mathematically, we have $$Y=USV^T=\sum_{k=1}^n\sigma_ku_kv_k^T$$ where the singular values $\sigma_1,\ldots,\sigma_n$ are non-negative and arranged in decreasing order. (For simplicity I am assuming that for $x_i\in\mathbb{R}^m$ the number of vectors is at least $n\geq m$. Otherwise the sum will have $m$ terms.) If we truncate the sum at $K<n$, then we get a "rank $K$ approximation" $\hat{Y}_K$ to the matrix $Y=\hat{Y}_n$. The truncated SVD is optimal in the sense that it minimizes the Frobenius norm of the reconstruction error $\|\hat{Y}_K-Y\|_F^2$.

The squared Frobenius norm is just the sum of the squares of all the entries of a matrix. In particular, for the scaled matrix $Y$ we will always have $$\|Y\|_F^2=n$$ For any matrix, the Frobenius norm is also equal to the sum of the squares of the singular values. In particular, for the $K$-rank approximation $\hat{Y}_K$ we have $$\|\hat{Y}_K\|_F^2=\sum_{k=1}^K\sigma_k^2$$ So we can define an "order $K$ coefficient of determination" as $$R_K^2\equiv\frac{\|\hat{Y}_K\|_F^2}{\|Y\|_F^2}\in[0,1]$$

A simple option would be to just use the first singular value, i.e. $$R_1^2\equiv\frac{\sigma_1^2}{n}$$ The maximum similarity would be for $n$ parallel vectors (i.e. scalar multiples of a single vector), giving $R_1^2=1$. The minimum similarity would be for $n$ orthogonal vectors, giving $R_1^2=\frac{1}{n}$.

A few final notes:

The first singular value $\sigma_1$ can be computed via an iterative method, so no detailed SVD need be computed. (Similar to PageRank; e.g. svds(Y,1) in Matlab.)
To get a similarity that can range over the entire [0,1] interval, you could simply normalize, i.e. $$\hat{R}_1^2\equiv\frac{\sigma_1^2-1}{n-1}$$
So in Matlab the similarity function could be defined via CosSimN = @(Y) (svds(Y,1)^2-1)/(size(Y,2)-1), if we assume that $Y$ is already available as input.
For $n=2$ the above will reduce to the absolute value of the standard cosine similarity.

Best Answer

Related Solutions

Solved – Euclidean Distance b/t unit vectors or cosine similarity where vectors are document vectors

Solved – Interpreting negative cosine similarity

Related Question