When the downstream applications only care about the direction of the word vectors (e.g. they only pay attention to the cosine similarity of two words), then normalize, and forget about length.

However, if the downstream applications are able to (or need to) consider more sensible aspects, such as word *significance*, or *consistency* in word usage (see below), then normalization might not be such a good idea.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

Also from Wilson and Schakel, 2015:

Most applications of word embeddings explore not the word vectors themselves, but relations between them to solve, for example, similarity and word relation tasks. For these tasks, it was found that using normalised word vectors improves performance. Word vector length is therefore typically ignored.

Normalizing is equivalent to losing the notion of length. That is, once you normalize the word vectors, you forget the length (norm, module) they had right after the training phase.

However, *sometimes* it's worth to take into consideration the original length of the word vectors.

Schakel and Wilson, 2015 observed some interesting facts regarding the
length of word vectors:

A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts.

Not only the direction, but also the length of word vectors carries important information.

Word vector length furnishes, in combination with term frequency, a useful measure of word significance.

The Spark documentation for this kind of thing doesn't seem very thorough, so I looked at the source. There's a comment here saying

```
// Need not divide with the norm of the given vector since it is constant.
```

This seems consistent with the following code.

So, it seems that `findSynonyms`

doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector. The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

Not sure why the number of iterations or data partitions should have any bearing on this.

## Best Answer

If you normalize the features, dot product is the same as cosine similarity. As for the general answer, there is usually no single approach that you should always use. Usually, the choice is an empirical one: you try different ones and compare the results. Sometimes, in practice, the choice is arbitrary: assuming that different metrics are quite similar and give very similar results, you just pick one.