Solved – Different models with gensim Word2Vec on python

natural languagepythonword embeddingsword2vec

I am trying to apply the word2vec model implemented in the library gensim in python. I have a list of sentences (each sentences is a list of words).

For instance let us have:

sentences=[['first','second','third','fourth']]*n

I implement two identical models:

model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)

I realize that the models sometimes are the same, and sometimes are different, depending on the value of n.

For instance, if n=1000 I obtain

print(model['first']==model2['first'])

while, for n=5000:

print(model['first']==model2['first'])
False

How is it possible?

Thank you very much!

Best Answer

The training phase of gensim is not fully deterministic. Take a look on the source code, on github. You will find usages of $rand$ operations.

For example, gensim has an argument called sample (by default $1 / 1000)$ which is the threshold for configuring which higher-frequency words are randomly down-sampled. In your case, this argument suddenly becomes active when $n > 1000$.


The fact that gensim is not fully deterministic gets a bit amplified due to the following phenomenon (and some others):

When you train a model, you learn a set of parameters. These parameters consist of the actual word vectors and other parameters as well (other weights).

For learning the parameters of the model (in the training phase), gensim / word2vec use a technique called (stochastic) gradient descent. This technique (optimization technique) searches the parameters of the model that best fit the training data. It seeks to minimize a function that quantifies, in a way, the ineffectiveness of the current parameters. This function is called the cost (objective) function.

And this function is very complex (besides the fact that we don't know it). In the training phase, we have to explore it gradually, iteratively, by means of the optimization technique. It's hard for the optimization technique (the aforementioned gradient descent) to find the global minimum of the objective function, which would correspond to finding the best parameters of the model.

At some point during the training phase, the parameters of the model might be good enough. After some more training, they might get worse. And improve again, after some more training examples.

See the following image from Wikipedia, which depicts fluctuations in the total objective function as gradient steps with respect to mini-batches are taken:

Fluctuations in the total objective function

The training phase is also quite sensible to a training hyper-parameter called learning rate. Gensim / word2vec allows you to control the learning rate via the $alpha$ argument, which is, by default, $0.025$.