Solved – Word embeddings with logistic regression

classificationmachine learningword embeddings

My goals is to classify a set of documents (e.g. 20newsgroups) into one of twenty categories. I can do this using Logistic Regression for example which takes as input a sparse $D\times V$ matrix in which each row is a document and each column represents a tf-idf smoothed count of words in that document.

Instead of using the sparse tf-idf matrix, I want to classify based on word-embeddings (word2vec or Glove) where each word is represented by e.g. a 300 dimensional vector. My question is how do you represent a document of word vectors as an input to a logistic regression that takes as input a matrix of size n_samples by n_features? How do you classify a document based on word embeddings?

Best Answer

While it's possible to combine word embeddings using weighted average or a concatenation of min / max values across word vectors as described in this post, the output vector loses semantic information. A better alternative is to train a doc2vec model which is an extension of word2vec that uses paragraph vectors as part of the context during training:

doc2vec

The word vectors in doc2vec are shared across all paragraphs while the paragraph vectors are unique to each paragraph. The doc2vec model is implemented in gensim. See the following ipython notebook for an example and this quora post for additional explanation of the doc2vec model.