Relation to Word2Vec
==========================================
Word2Vec in a simple picture:
More in-depth explanation:
I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases
and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.
Explanation using Lasagne code
Let's break down the Lasagne code snippet:
x = T.imatrix()
x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.
l_in = InputLayer((3, ))
The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.
W = np.arange(3*5).reshape((3, 5)).astype('float32')
This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.
Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.
It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.
l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)
The embedding layer
output = get_output(l1, x)
Symbolic Theano expression for the embedding.
f = theano.function([x], output)
Theano function which computes the embedding.
x_test = np.array([[0, 2], [1, 2]]).astype('int32')
It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.
We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.
f(x_test)
returns:
array([[[ 0., 1., 2., 3., 4.],
[ 10., 11., 12., 13., 14.]],
[[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.]]], dtype=float32)
To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]
. You will see that it raises a matrix mis-match error.
One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.
Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max.
The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.
TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.
Reference:
Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.
Credits: Thanks to @user115202 for bringing this paper to my attention.
Best Answer
One possible approach could be to encode the $i$th data point as follows:
$$x_i = [n_i, w_{i1}, \dots, w_{im}]$$
$n_i$ is a numeric value and $w_{i1}, \dots, w_{im}$ are binary, corresponding to a one-hot encoding of a word from a vocabulary of size $m$. If the $i$th data point is a word, then $n_i$ is set to zero, and each $w_{ij}$ is set to 1 if the word matches the $j$th element of the vocabulary (otherwise 0). If the $i$th data point is numeric, $n_i$ is set to this value, and all $w_{ij}$ are set to 0. The numeric values should probably be normalized after encoding this way.
This approach treats numerically similar values as similar to each other (e.g. the input representation of 9.99 is similar to that of 10, but different than 1000). This may or may not be appropriate, depending on your application. You could also imagine applying various transformations (e.g. taking the log to squash large values together).
If you're going to treat common numeric values as separate words, you could also map all non-common numeric values to the same word, representing 'uncommon numeric value'. Of course, this would make them indistinguishable.
Another possible approach would be to quantize the numeric values (possibly adaptively, i.e. with unequal bin widths). Then map all values within each bin to the same 'word'. A possible downside of this approach is that the quantization is performed a priori, so it's independent of the context in which a particular numeric value occurs. For example, 9.9 and 10 might mean very similar things in one context, but very different things in another. If they're mapped to the same word, the distinction would be lost.