One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.
Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max.
The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.
TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.
Reference:
Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.
Credits: Thanks to @user115202 for bringing this paper to my attention.
Yes it does. Here you can find example of network that uses multiplication, among other methods, for combining embeddings. As described in my answer
element-wise product $u*v$, is basically an interaction term, this can
catch similarities between values (big * big = bigger; small * small =
smaller), or the discrepancies (negative * positive = negative) (see
example here).
So it is perfectly reasonable way of combining weights, but often, as in above example, people use in parallel several different methods for combining them, to produce different kind of features for the model.
Best Answer
This quote is clearly talking about sentence embeddings, obtained from word embeddings.
If the sentence $s$ consists of words $(w_1, ..., w_n)$, we'd like to define an embedding vector $Emb_s(s) \in \mathbb{R}^d$ for some $d > 0$.
The authors of this paper propose to compute it from the embeddings of words $w_i$, let's call them $Emb_w(w_i)$, so that $Emb_s(s)$ is a linear combination of $Emb_w(w_i)$ and has the same dimensionality $d$:
$$Emb_w(s) = \sum_{w_i \in s} c_i \cdot Emb_w(w_i)$$
.... where $c_i \in \mathbb{R}$ are the coefficients (scalars). Note that $d$ is the same for all word vectors.
In the simplest case, all $c_i = 1$, so $Emb_s(s)$ would be a sum of constituent vectors. A better approach is to do averaging, i.e., $c_i = \frac{1}{n}$ (to handle sentences of different length). Note that the dimensionality doesn't change, it's still $d$.
Finally, the proposed method is the weighted average, where the weights are TF-IDF. This allows to capture that some words in a sentence are naturally more valuable than others. Once again, there's no problem with dimensions, because it's a sum of $\mathbb{R}^d$ vectors, multiplied by scalars.