Solved – Spark MLLib’s Word2Vec cosine similarity greater than 1

artificial intelligencespark-mllibword embeddingsword2vec

On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.

In my knowledge, cosine similarity should always be about $-1 < \cos\theta < 1$. Does anyone know why?

Best Answer

The Spark documentation for this kind of thing doesn't seem very thorough, so I looked at the source. There's a comment here saying

// Need not divide with the norm of the given vector since it is constant.

This seems consistent with the following code.

So, it seems that findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector. The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

Not sure why the number of iterations or data partitions should have any bearing on this.