By the end of the function you take the arccosine of the computed score.
Actually, according to the definition (see the Wikipedia page for example) you should not.
If you want the dissimilarity, I think you should just do
return (1 - sum0 / ( sqrt(sum1) * sqrt(sum2) ));
The similarity score will always be within $(-1,1)$, by direct application of the Cauchy-Schwarz inequality. If you want it to be within $(0,1)$ you can take the square or the absolute value. Actually, given your input, the similarity should always be in $(0,1)$ because all your values are positive.
By taking the arccosine you get an angle in radian between $0$ and $2\pi$. The gain of taking the arccosine and dividing by $2\pi$ is null, plus it is not what most people will call the cosine similarity.
A distance satisfies the axioms of a distance:
- $d(x,y)>0$ if $x\neq y$, and $d(x,x)=0$.
- $d(x,y)=d(y,x)$
- $d(x,z)\leq d(x,y)+d(y,z)$.
The third is known as the triangle inequality. A dissimilarity satisfies only 1. and 2.
First of all, in many applications you do not need a distance metric, but a dissimilarity will be okay. So make sure that triangle inequality is needed.
In mathematics, triangle inequality is part of the definition of a metric, and distances in mathematics are synonymous to metrics. But in database literature, often distances are not required to be metric.
Second, we cannot recommend a metric for your data, if we don't know your data.
Third, Cosine is closely related to Euclidean distance. Assuming that all your data is normalized to unit length ($||x||=1=||y||$), then
\begin{align*}
\text{Euclid}^2(x,y)&=\sum_i (x_i-y_i)^2\\
&=\sum_ix^2+\sum_iy^2-2\sum_i x_iy_i\\
&=1+1-2\cdot x\cdot y\\
&=2(1-x\cdot y)
\end{align*}
Therefore, if your data is normalized to unit length,
$$
\sqrt{1-x\cdot y}
$$
is a metric. Because as just shown, $\sqrt{1-x\cdot y}=\sqrt{\frac{1}{2}}\text{Euclid}(x,y)$.
While this may get you overly excited that there is a metric based on the dot product, recall that this only holds if all your data lives on the unit circle and this is just Euclidean metric. If this is the behaviour you want, normalize your data and use Euclidean distance... Cosine distance is exactly this normalization. It includes normalization terms for the length of the vectors to ensure they are of unit length...
If your data is sparse, and you can afford to keep all vector lengths in memory, then this may be a faster way to compute Euclidean distance. If you have a sparsity of $s$, the expected sparsity of the dot product is $s^2$, so this can yield a substantial performance benefit of $1/s$, if you have a good implementation.
Update: it was pointed out to me that computing Euclidean this way can suffer from a numerical instability called "catastrophic cancellation".
Best Answer
$$\text{cos-dist}(A, B) = 1 - \text{cos-sim}(A, B)$$ $$\text{cos-sim}(A, B) = \frac{\langle A, B \rangle}{||A|| \cdot ||B||} = \frac{\sum\limits_{i=1}^n A_i \cdot B_i}{\sqrt{\sum\limits_{i=1}^n A_i^2} \cdot \sqrt{\sum\limits_{i=1}^n B_i^2}}$$
Triangle inequality for cosine distance tooks a form of (of course it doesn't hold): $$\text{cos-dist}(A,C) \nleq \text{cos-dist}(A, B) + \text{cos-dist}(B, C)$$ which is equivalent to: $$1 - \text{cos-sim}(A,C) \nleq 1 - \text{cos-sim}(A, B) + 1 - \text{cos-sim}(B, C)$$ and after simple transformations: $$1 + \text{cos-sim}(A, C) \ngeq \text{cos-sim}(A, B) + \text{cos-sim}(B, C)$$
Now, you're trying to find such three vectors A, B and C that: $$1 + \text{cos-sim}(A, C) < \text{cos-sim}(A, B) + \text{cos-sim}(B, C)$$
Let $A, B, C \in \mathbb{R}^2$ and all of them are of unit length A = [1, 0], B = $\left[\frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2}\right]$, C = [0, 1]. Note that vectors A and C are orthogonal, so we would get simply $0$: $$\text{cos-sim}(A, C) = \frac{0}{\sqrt{1}\sqrt{1}} = 0$$ Each pair of vectors A & B as well as B & C would give the same value: $$ \text{cos-sim}(A, B) = \frac{\frac{\sqrt{2}}{2} + 0}{\sqrt{1}\sqrt{1}} = \frac{\sqrt{2}}{2},~~~ \text{cos-sim}(B, C) = \frac{0+\frac{\sqrt{2}}{2}}{\sqrt{1}\sqrt{1}} = \frac{\sqrt{2}}{2}$$. Finally, we could defeat primary inequality by proving that: $$ 1 + 0 < \frac{\sqrt{2}}{2} + \frac{\sqrt{2}}{2}$$ $$ 1 < \sqrt{2} \approx 1.41 \dots$$