Solved – Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words

natural languagesoftmaxword embeddingsword2vec

I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec's CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.

Best Answer

I'm not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I'd simplify the reasoning to this:

  • Hierarchical softmax provides for an improvement in training efficiency since the output vector is determined by a tree-like traversal of the network layers; a given training sample only has to evaluate/update $O(log(N))$ network units, not $O(N)$. This essentially expands the weights to support a large vocabulary - a given word is related to fewer neurons and visa-versa.
  • Negative sampling is a way to sample the training data, similar to stochastic gradient descent, but the key is you look for negative training examples. Intuitively, it trains based on sampling places it might have expected a word, but didn't find one, which is faster than training an entire corpus every iteration and makes sense for common words.

The two methods don't seem to be exclusive, theoretically, but anyway that seems to be why they'd be better for frequent and infrequent words.