I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec's CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.
Solved – Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words
natural languagesoftmaxword embeddingsword2vec
Best Answer
I'm not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I'd simplify the reasoning to this:
The two methods don't seem to be exclusive, theoretically, but anyway that seems to be why they'd be better for frequent and infrequent words.