On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.
In my knowledge, cosine similarity should always be about $-1 < \cos\theta < 1$. Does anyone know why?
Best Answer
The Spark documentation for this kind of thing doesn't seem very thorough, so I looked at the source. There's a comment here saying
This seems consistent with the following code.
So, it seems that
findSynonyms
doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector. The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.Not sure why the number of iterations or data partitions should have any bearing on this.