One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.
Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max.
The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.
TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.
Reference:
Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.
Credits: Thanks to @user115202 for bringing this paper to my attention.
There are dozens of ways to produce sentence embedding.
We can group them into 3 types:
- Unordered/Weakly Ordered:
- things like Bag of Words, Bag of ngrams
- Dimentionality reduced versions of the above (take a Bag of words etc of your sentences from a training set, apply PCA, now the sentences embeddings are dense)
- Sum/dot-product/mean of Word Embeddings
- Doc2Vec
- Doc2Vec is discredited by one of its own authors. See this question.
- Sequential Models (i.e. RNNs)
- Structured Models (i.e. Recursive Neural Networks)
- As per RNNs train to do some task. Extra final hidden layer
- See Socher's thesis
- Generally worse at most things than the more optimized/developed RNNs, better motivated linquistically then RNNs.
Technically many of these methods produce word embedding as a biproduct.
I did a bit of a comparison a few years a go:
2015: How Well Sentence Embeddings Capture Meaning, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun..
Which is a bit outdated now, (does't include skip-thought, or any of the other RNN based methods).
And it is just one way to evaluate them.
Different purposes are suited to different evaluations.
My suggestion would be to start from the simplest possible (Bag of Words),
and move up to the most complex only as required (Some kind of matrix-vector dependency-tree unfolding recursive auto-encoder).
I wrote a book which includes a chapter discussing many methods, if one is particularly interested:
2018: Neural Representations of Natural Language, Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun;
Springer: Studies in Computational Intelligence
Best Answer
A loop is your only option here if you have not saved your word embeddings in any other format such as a binary file. Just use a list comprehension which should be fairly quick even with 2m entries. Assuming your dictionary is named 'd' you could do the following:
Once you converted the dictionary values into a numpy array you could normalize your data using some convenient tools from sklearn such as minmax scaler: