Solved – the meaning of the average value of all word vectors in the sentence

natural languagetext miningword2vec

Today I saw a sentiment analysis article here. There is one piece hard to understand:

Next we have to build word vectors for input text in order to average
the value of all word vectors in the tweet using the following
function:

#Build word vector for training set by using the average value of all word vectors in the tweet, then scale
def buildWordVector(text, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in text:
        try:
            vec += imdb_w2v[word].reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

Scaling moves our data set is part of the process of standardization
where we move our dataset into a gaussian distribution with a mean of
zero, meaning that values above the mean will be positive, and those
below the mean will be negative. Many ML models require scaled
datasets to perform effectively, especially those with many features
(like text classifiers).

He just add up all word vectors in the sentence(tweet), divide by word count, turning to a new vector for training.

As I know, a word vector is a hidden layer's parameters. So the average value of those hidden layer's parameters, what's the meaning it present and why it can be used for training?

Best Answer

You can think of the average of the word embeddings as being a continuous space version of the traditional bag-of-words representation. Bag-of-words (BoW) represents a document with a vector the size of the vocabulary where the entries in the vector contain the count for each word. BoW treats each word independently and ignores the order of the words, but it works quite well for text classification.

If you multiply the BoW vector with the word embedding matrix and divide by the total number of words in the document then you have the average word2vec representation. This contains mostly the same information as BoW but in a lower dimensional encoding. You can actually train a model to recover which words were used in the document from the average word2vec vector. So, you aren't losing very much information by compressing the representation like this.