Solved – Word Vectors in Word2Vec

deep learningmachine learningsentiment analysisword embeddingsword2vec

I am trying to do sentiment analysis. In order to convert the words to word vectors I am using word2vec model from gensim package. Suppose I have all the sentences in a list named 'sentences' and I am passing these sentences to word2vec as follows :

model = word2vec.Word2Vec(sentences, workers=4 , min_count=40, size=300,   window=5, sample=1e-3)

Since I am noob to word vectors I have two doubts.
1- Setting the number of features to 300 defines the features of a word vector. But what these features signify? If each word in this model is represented by a 1×300 numpy array, then what do these 300 features signify for that word?

2- What does down sampling as represented by 'sample' parameter in the above model do in actual?

Thanks in advance.

Best Answer

The features don't represent anything in particular: it's not like you can say oh, the third component is high so therefore this word represents an animal. Instead, what's meaningful about the features is similarities. Words that appear in similar contexts tend to end up with similar vectors. It also often satisfies analogy properties that you might have heard of, e.g. that (woman - man) + king is closest to queen. But again, you can't directly look at a feature and tell anything other than by looking at the words that are near it.


As to the sample parameter, the documentation says:

sample = threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).

(It's interesting that the default value is outside the "useful range"....)

When you train word2vec, it samples a lot of pairs of words, saying effectively "words $A$ and $B$ often appear together" and "words $A$ and $C$ don't." The sample argument apparently makes it choose very-frequent words less often in that training process, but it appears to be underdocumented. The relevant bit of source code appears to be here, if you want to see exactly what's going on.