Solved – How is the .similarity method in SpaCy computed

natural languagenltktf-idfword2vec

Not Sure if this is the right stack site, but here goes.

How does the .similiarity method work?

Wow spaCy is great! Its tfidf model could be easier, but w2v with only one line of code?!

In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs.

After nlp = spacy.load('en') and doc = nlp(raw_text)
we can do .similarity queries between tokens and chunks.
However, what is being calculated behind the scenes in this .similarity method?

SpaCy already has the incredibly simple .vector, which computes the w2v vector as trained from the GloVe model (how cool would a .tfidf or .fasttext method be?).

Is the model simply computing the cosine similarity between these two w2v, .vector, vectors or comparing some other matrix? The specifics aren't clear in the documentation; any help appreciated!

Best Answer

Found the answer, in short, it's yes:

Link to Source Code

return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

This looks like it's the formula for computing cosine similarity and the vectors seem to be created with SpaCy's .vector which the documentation says is trained from GloVe's w2v model.

Related Question