Solved – How to train sentence/paragraph/document embeddings

doc2vecmachine learningnatural languageword embeddingsword2vec

I'm well aware of word embeddings (word2vec or Glove) and I know of four papers treating the subject of more general embeddings :

Distributed Representations of Sentences and Documents – Quoc V. Le, Tomas Mikolov
https://arxiv.org/abs/1405.4053

Document Embedding with Paragraph Vectors – Andrew M. Dai, Christopher Olah Quoc V. Le
https://arxiv.org/abs/1507.07998

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation – Jey Han Lau, Timothy Baldwin
https://arxiv.org/abs/1607.05368

which all talk about the same method and

Skip-Thought Vectors – Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
https://arxiv.org/abs/1506.06726

which maps sentences to their embeddings.

I also know that you can just take the average of the word embeddings but I am wondering two things :

  • Whether it exists other ways to use word embeddings to make sentence/paragraph/document embeddings.
  • Whether it exists ways of computing such embeddings without using word embeddings.

In other words, is something like sentence2vec/paragraph2vec/doc2vec possible except with the techniques in these four papers and the simple averaging process (and still obtaining good results) ?.

Best Answer

There are dozens of ways to produce sentence embedding. We can group them into 3 types:

  • Unordered/Weakly Ordered:
    • things like Bag of Words, Bag of ngrams
    • Dimentionality reduced versions of the above (take a Bag of words etc of your sentences from a training set, apply PCA, now the sentences embeddings are dense)
    • Sum/dot-product/mean of Word Embeddings
    • Doc2Vec
      • Doc2Vec is discredited by one of its own authors. See this question.
  • Sequential Models (i.e. RNNs)
  • Structured Models (i.e. Recursive Neural Networks)
    • As per RNNs train to do some task. Extra final hidden layer
    • See Socher's thesis
    • Generally worse at most things than the more optimized/developed RNNs, better motivated linquistically then RNNs.

Technically many of these methods produce word embedding as a biproduct.

I did a bit of a comparison a few years a go: 2015: How Well Sentence Embeddings Capture Meaning, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun.. Which is a bit outdated now, (does't include skip-thought, or any of the other RNN based methods). And it is just one way to evaluate them. Different purposes are suited to different evaluations.

My suggestion would be to start from the simplest possible (Bag of Words), and move up to the most complex only as required (Some kind of matrix-vector dependency-tree unfolding recursive auto-encoder).

I wrote a book which includes a chapter discussing many methods, if one is particularly interested: 2018: Neural Representations of Natural Language, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun; Springer: Studies in Computational Intelligence

Related Question