One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.
Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max.
The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.
TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.
Reference:
Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.
Credits: Thanks to @user115202 for bringing this paper to my attention.
There are dozens of ways to produce sentence embedding.
We can group them into 3 types:
- Unordered/Weakly Ordered:
- things like Bag of Words, Bag of ngrams
- Dimentionality reduced versions of the above (take a Bag of words etc of your sentences from a training set, apply PCA, now the sentences embeddings are dense)
- Sum/dot-product/mean of Word Embeddings
- Doc2Vec
- Doc2Vec is discredited by one of its own authors. See this question.
- Sequential Models (i.e. RNNs)
- Structured Models (i.e. Recursive Neural Networks)
- As per RNNs train to do some task. Extra final hidden layer
- See Socher's thesis
- Generally worse at most things than the more optimized/developed RNNs, better motivated linquistically then RNNs.
Technically many of these methods produce word embedding as a biproduct.
I did a bit of a comparison a few years a go:
2015: How Well Sentence Embeddings Capture Meaning, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun..
Which is a bit outdated now, (does't include skip-thought, or any of the other RNN based methods).
And it is just one way to evaluate them.
Different purposes are suited to different evaluations.
My suggestion would be to start from the simplest possible (Bag of Words),
and move up to the most complex only as required (Some kind of matrix-vector dependency-tree unfolding recursive auto-encoder).
I wrote a book which includes a chapter discussing many methods, if one is particularly interested:
2018: Neural Representations of Natural Language, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun;
Springer: Studies in Computational Intelligence
Best Answer
Have you checked doc2vec? Doc2vec returns a fixed representation for each document (sentence in your case), regardless of its length. Here is the original paper. There is also a Python implementation from the gensim package.
Another usual approach for text classification is to calculate the tf-idf matrix and use it as input to a classifier, in which case the columns of the matrix are the features. The tf-idf matrix represents the sentences in its rows and the unique terms (words) of all your sentences in its columns. Each element in the matrix is the tf-idf value of this sentence-term. You can find more information in the wiki page. scikit-learn has an implementation of tf-idf here.