Word2Vec – Differences Between Two Weight Matrices in Natural Language Processing

bag of wordsnatural languageword embeddingsword2vec

In Word2Vec algorithm, two weight matrices are learnt :
W : Input-hidden layer matrix
W': Hidden-output layer matrix

For reference, CBOW model architecture:
CBOW Word2Vec Model architecture

  1. Why is W chosen to represent the word vectors and not W' ?
    They both seem to encode the same information.

  2. What is the interpretation of the W' matrix? Just like W represents word embeddings.

Best Answer

They both capture the word semantics. Not only W, sometimes W' is also used as word vectors. Even in somecases (W+W')/2 has also been used and better results in that particular task have been obtained.

Another thing to notice is that no activation function is used after the hidden layer, so the transformation from input to output is W[i]*W'^T for any activated word i in input. So for every word vector you are trying to learn the words that mostly occurs in its vicinity(context-window).

You can think of the two linear transformation as,

  • Semantics encoder from n-hot vector: Word-list to semantics
  • Semantics decoder which outputs a probability vector: Semantics to probability distribution over words.

Formally, vectors in W and W' are called input and output word vector representations, respectively.