Solved – Apply word embeddings to entire document, to get a feature vector

classificationnatural languagesupervised learningword embeddingsword2vec

How do I use a word embedding to map a document to a feature vector, suitable for use with supervised learning?

A word embedding maps each word $w$ to a vector $v \in \mathbb{R}^d$, where $d$ is some not-too-large number (e.g., 500). Popular word embeddings include word2vec and Glove.

I want to apply supervised learning to classify documents. I'm currently mapping each document to a feature vector using the bag-of-words representation, then applying an off-the-shelf classifier. I'd like replace the bag-of-words feature vector with something based on an existing pre-trained word embedding, to take advantage of the semantic knowledge that's contained in the word embedding. Is there a standard way to do that?

I can imagine some possibilities, but I don't know if there's something that makes the most sense. Candidate approaches I've considered:

  • I could compute the vector for each word in the document, and average all of them. However, this seems like it might lose a lot of information. For instance, with the bag-of-words representation, if there are a few words that are highly relevant to classification task and most words are irrelevant, the classifier can easily learn that; if I average the vectors for all the words in the document, the classifier has no chance.

  • Concatenating the vectors for all the words doesn't work, because it doesn't lead to a fixed-size feature vector. Also it seems like a bad idea because it will be overly sensitive to the specific placement of a word.

  • I could use the word embedding to cluster the vocabulary of all words into a fixed set of clusters, say, 1000 clusters, where I use cosine similarity on the vectors as a measure of word similarity. Then, instead of a bag-of-words, I could have a bag-of-clusters: the feature vector I supply to the classifer could be a 1000-vector, where the $i$th component counts the number of words in the document that are part of cluster $i$.

  • Given a word $w$, these word embeddings let me compute a set of the top 20 most similar words $w_1,\dots,w_{20}$ and their similarity score $s_1,\dots,s_{20}$. I could adapt the bag-of-words-like feature vector using this. When I see the word $w$, in addition to incrementing the element corresponding to word $w$ by $1$, I could also increment the element corresponding to word $w_1$ by $s_1$, increment the element corresponding to word $w_2$ by $s_2$, and so on.

Is there any specific approach that is likely to work well for document classification?


I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embedding.

Best Answer

One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.

Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max. The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.

TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.

Reference:

Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.

Credits: Thanks to @user115202 for bringing this paper to my attention.

Related Question