Solved – How to use multiple words token with gensim word2vec

word2vec

I'm using pre-trained word2vec model lexvec.enwiki+newscrawl.300d.W.pos.vectors with gensim

this model "knows" a lot of words, but it doesn't know things like this: "great britain" or "star fruit"

how to use phrases in my case?

Best Answer

The simplest and most effective way to treat this case given the expression occurs enough times in your training corpus is to transform every occurrence in the corpus into something that looks like a single word when training your model.

If we train word2vec using the sentence, "I visited Great Britain", it will update vectors for I, visited, Great, Britain. If we transform this sentence into "I visited Great_Britain", it will update vectors I, visited, Great_Britain.

Popular embedding models such as word2vec, GloVe, and LexVec tokenize using whitespace, so anything in between whitespace is considered a word. In the example above I used an underscore to transform Great Britain into a single word but you can use any non-whitespace character you like.

The trick is then identifying these special expressions (Great Britain, star fruit) in your training corpus. There are different ways of doing this with differing complexity. As you seem to be new to the field, I recommend sticking to a dictionary approach. Keep a dictionary of phrases for which you'd like individual vector representations and go over your corpus joining the constituting words be underscores. Then train your embedding.

Another approach which might work for phrases such as "star fruit" is additive composition. To the representation for "star fruit", element-wise add the vectors of "star" and "fruit". I would only try this approach for phrases which are very rarely observed in the training corpus and where this composition makes sense.

Related Question