Solved – How to embed arbitrary integers and real numbers into the same space as words

lstmneural networksrecurrent neural networkword embeddings

I'm trying to build a recurrent neural network (specifically a BiLSTM) where at each time step the input could be an integer, a real number, or a word from a large vocabulary. Since low-dimensional word embeddings work well for word-only RNNs, I'd like to be able to embed integers and real numbers in the same embedding space as the words.

I think it would make sense for frequent numeric values (like small integers 1 and 10) to be treated exactly like new words in the vocabulary and be given their own embedding. But how should I handle numeric values that occur infrequently (or not at all) in my training corpus? For infrequent words, I'm backing off to a character-based RNN to get the embedding like in Ling et al., but making an equivalent digit-based RNN seems less principled. Do I have any other options?

Best Answer

One possible approach could be to encode the $i$th data point as follows:

$$x_i = [n_i, w_{i1}, \dots, w_{im}]$$

$n_i$ is a numeric value and $w_{i1}, \dots, w_{im}$ are binary, corresponding to a one-hot encoding of a word from a vocabulary of size $m$. If the $i$th data point is a word, then $n_i$ is set to zero, and each $w_{ij}$ is set to 1 if the word matches the $j$th element of the vocabulary (otherwise 0). If the $i$th data point is numeric, $n_i$ is set to this value, and all $w_{ij}$ are set to 0. The numeric values should probably be normalized after encoding this way.

This approach treats numerically similar values as similar to each other (e.g. the input representation of 9.99 is similar to that of 10, but different than 1000). This may or may not be appropriate, depending on your application. You could also imagine applying various transformations (e.g. taking the log to squash large values together).

If you're going to treat common numeric values as separate words, you could also map all non-common numeric values to the same word, representing 'uncommon numeric value'. Of course, this would make them indistinguishable.

Another possible approach would be to quantize the numeric values (possibly adaptively, i.e. with unequal bin widths). Then map all values within each bin to the same 'word'. A possible downside of this approach is that the quantization is performed a priori, so it's independent of the context in which a particular numeric value occurs. For example, 9.9 and 10 might mean very similar things in one context, but very different things in another. If they're mapped to the same word, the distinction would be lost.