Solved – Should I normalize word2vec’s word vectors before using them

natural languageword embeddingsword2vec

After training word vectors with word2vec, is it better to normalize them before using them for some downstream applications? I.e what are the pros/cons of normalizing them?

Best Answer

When the downstream applications only care about the direction of the word vectors (e.g. they only pay attention to the cosine similarity of two words), then normalize, and forget about length.

However, if the downstream applications are able to (or need to) consider more sensible aspects, such as word significance, or consistency in word usage (see below), then normalization might not be such a good idea.


From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

Also from Wilson and Schakel, 2015:

Most applications of word embeddings explore not the word vectors themselves, but relations between them to solve, for example, similarity and word relation tasks. For these tasks, it was found that using normalised word vectors improves performance. Word vector length is therefore typically ignored.

Normalizing is equivalent to losing the notion of length. That is, once you normalize the word vectors, you forget the length (norm, module) they had right after the training phase.

However, sometimes it's worth to take into consideration the original length of the word vectors.

Schakel and Wilson, 2015 observed some interesting facts regarding the length of word vectors:

A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts.

Not only the direction, but also the length of word vectors carries important information.

Word vector length furnishes, in combination with term frequency, a useful measure of word significance.