Solved – Cosine similarity between a clean signal and its noisy version

circular statisticsdistributionsmathematical-statistics

Given a $D$-dimensional datum that is an iid sample from a spherical Gaussian distribution, and the noise-corrupted version of that datum generated by adding spherical Gaussian noise, is there a useful way to express the distribution of the cosine similarity between the signal and its noisy version?

Here's what I tried: since cosine similarity depends only on the angle between these two $D$-dimensional vectors, and not their magnitude or absolute angle, without loss of generality we can "rescale" the problem so that the signal has Euclidean norm of 1, and also rotate the problem so that the signal is aligned with the first axis. Then the problem is stated as:

Given a $D$-dimensional datum $X=(1,0,0,…)'$ and its noise-corrupted version $Y = X + \mathcal{N}^D(0, \sigma)$ (where $\sigma$ is the standard deviation), what is the distribution of the cosine similarity $\frac{X \cdot Y}{||X|| ||Y||}$?

The constraints on $X$ mean the expression can be simplified to $\frac{y_1}{||Y||}$ where $y_1$ is the first element of $Y$, and clearly $y_1$ is distributed as $\mathcal{N}(1, \sigma)$. The overall expression, $\frac{y_1}{||Y||}$, well we can expand it out, but I don't know of any way to rewrite it so that I can get (for example) expressions for its mean and variance.

(Some simple results for the extremes: When $\sigma=0$, the cosine similarity is 1. As $\sigma\to\infty$, the cosine similarity becomes the distribution of dot products between two random unit vectors.)

Best Answer

Assuming that $\sigma$ is the standard deviation, and that the normal random variables involved are independent, this can be worked up to a point as follows:

$$||Y|| = \left(y_1^2+y_2^2+...+y_D^2\right)^{1/2} = \sigma\left(\frac{y_1^2+y_2^2+...+y_D^2}{\sigma^2}\right)^{1/2}$$

$$=\sigma\left(\left(\frac{y_1}{\sigma}\right)^2+\sum_{i=2}^{D}\left(\frac{y_i}{\sigma}\right)^2 \right)^{1/2} $$

Now $$W\equiv \left(\frac{y_1}{\sigma}\right)^2 \sim \mathcal \chi_{NC}^2(1;1/\sigma^2),\qquad Z\equiv \sum_{i=2}^{D}\left(\frac{y_i}{\sigma}\right)^2 \sim \mathcal \chi^2(D-1;0)$$

i.e. the r.v. $\left(\frac{y_1}{\sigma}\right)^2$ follows a non-central chi-square distribution with one degree of freedom and non-centrality parameter $1/\sigma^2$, while the sum follows a (central) chi-square distribution with $D-1$ degrees of freedom. So

$$ \frac{y_1}{||Y||} = \frac{y_1}{\sigma\cdot (W+Z)^{1/2}} = \frac {W^{1/2}}{(W+Z)^{1/2}} = \left(\frac {W}{W+Z} \right)^{1/2}$$

The main problem here is that the numerator and the denominator are not independent. At that point one is tempted to define the random variable

$$U\equiv \left(\frac{y_1}{||Y||}\right)^{-2} = 1+ \frac {Z}{W} = 1+(D-1)\frac {[Z/(D-1)]}{W}$$

The variable $\frac {[Z/(D-1)]}{W}$ can be related to the doubly non-central $F$-distribution (see here), but in any case, the matter remains complicated, and even more so because of the need to revert back to the original random variable of interest, the cosine similarity.

Related Question