Machine Learning – LDA vs Word2Vec

latent-variablemachine learningnatural languageself-studyword2vec

I am trying to understand what is similarity between Latent Dirichlet Allocation and word2vec for calculating word similarity.

As I understand, LDA maps words to a vector of probabilities of latent topics, while word2vec maps them to a vector of real numbers (related to singular value decomposition of pointwise mutual information, see O. Levy,
Y. Goldberg, "Neural Word Embedding as Implicit Matrix Factorization"
; see also How does word2vec work?).

I am interested both in theoretical relations (can one be considered a generalization, or variation of the other) and practical (when to use one but not the other).

Related:

Best Answer

An answer to Topic models and word co-occurrence methods covers the difference (skip-gram word2vec is compression of pointwise mutual information (PMI)).

So:

  • neither method is a generalization of another,
  • word2vec allows us to use vector geometry (like word analogy, e.g. $v_{king} - v_{man} + v_{woman} \approx v_{queen}$, I wrote an overview of word2vec)
  • LDA sees higher correlations than two-element,
  • LDA gives interpretable topics.

Some difference is discussed in the slides word2vec, LDA, and introducing a new hybrid algorithm: lda2vec - Christopher Moody.