Solved – Word2vec representation used for RNNs turn a word into a scalar or a vector

deep learningnatural languageword2vec

Based on this post on quora.com and other sources I got the impression that each word in the word2vec representation is represented by a vector containing e.g. 500 dimensions. However, when looking into the code on sentiment analysis from the tutorial at deeplearning.net, I've found that each sentence is simply captured in a vector with dimensions of ~40-80. One such example is below:

x = [17, 25, 769, 83, 3, 14, 80, 62, 3221, 5, 928, 3, 1782, 6, 1, 1, 771, 24, 3350, 7, 1112, 228, 5, 3978, 4, 17, 25, 1212, 80, 6, 189, 7, 62, 1293, 5, 514, 4, 2, 131, 10, 1146, 480, 59, 413, 213, 117, 3, 14, 824, 69, 611, 2, 239, 73, 222, 72, 2338, 2, 67, 147, 4, 15, 1164, 123, 17, 10, 6, 100, 111, 23, 45, 228, 25, 4427, 8, 1131, 73, 31, 4]
y = [1]

With a sentence consisting of many words, it is clear that this vector cannot represent each word using 200 or 500 dimensions. Instead it seems that each word is approxmiated by a single scalar value. How can this be?

Best Answer

Each word in a given sentence is converted into an index on an array that contains a dictionary of all words in your corpus. This is your variable length x vector and is separate from the word2vec embeddings.

This index is also used to lookup the associated fixed length word2vec vectors in a lookup table during training and prediction.

This RNN deeplearning tutorial explains it better with code samples.

In torch this word index lookup is handled by the Lookup table layer.

Related Question