Solved – Is feature normalisation needed prior to computing cosine distance

cosine distancecosine similaritynormalizationsimilarities

I have a dataset of equal length feature vectors, where each vector contains around 20 features extracted from an audio file (fundamental frequency, BPM, ratios of high to low frequencies etc).

I am currently using cosine Similarity to measure the distance between vectors to give an indication of sound similarity e.g. between two files.

I understand that for Euclidean distance it is important to normalise features across the dataset prior to computing distances. Is this also the case with Cosine distance?

If not, is there a similarity metric that would be agnostic of the ranges of individual features?

Or… alternatively, is there any "quick and dirty" method for weighting the features (in conjunction with appropriate similarity measure) that doesn't require access to the entire data set.

The features have very different ranges, but for technical reasons I'd ideally like to avoid a normalisation step.

Best Answer

The definition of the cosine similarity is:

$$ \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\|_2 \|\mathbf{B}\|_2} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $$

It is sensitive to the mean of features. To see this, choose some $j \in \{1, \ldots, n\}$, and add a very large positive number $k$ to the $j$th component of each vector. The similarity will then be $$ \sim \frac{k^2}{\sqrt{k^2}\sqrt{k^2}} = 1. $$

For this reason, the adjusted cosine similarity is often used. It is simply the cosine similarity applied to mean-removed features.

Related Question