Solved – Difference between non-contextual and contextual word embeddings

natural languageword embeddings

It is often stated that word2vec and GloVe are non-contextual embeddings while LSTM and Transformer-based (e.g. BERT) embeddings are contextual. The way I understand it, however, all word embeddings are fundamentally non-contextual but can be made contextual by incorporating hidden layers:

  1. The word2vec model is trained to learn embeddings that predict either the probability of a surrounding word occurring given a center word (SkipGram) or vice versa (CBoW). The surrounding words are also called context words because they appear in the context of the center word.

  2. The GloVe model is trained such that a pair of embeddings has weights that reflect their co-occurrence probability. The latter is defined as the percentage of times that a given word $y$ occurs within some context window of word $x$.

  3. If embeddings are trained from scratch in an encoder / decoder framework involving RNNs (or their variants), then, at the input layer, the embedding that you look up for a given word reflects nothing about the context of the word in that particular sentence.

  4. The same as (3.) goes for Transformer-based architectures.

word2vec and GloVe embeddings can be plugged into any type of neural language model, and contextual embeddings can be derived from them by incorporating hidden layers. These layers extract the meaning of a given word, accounting for the words it is surrounded by in that particular sentence. Similarly, while hidden layers of an LSTM encoder or Transformer do extract information about surrounding words to represent a given word, the embeddings at the input layer do not.

Is my understanding of the difference between non-contextual and contextual embeddings correct?

Best Answer

Your understanding is correct. Word embeddings, i.e., vectors you retrieve from a lookup table are always non-contextual, not matter in what this is happening. (It is slightly different in ELMo which uses a character-based network to get a word embedding, but it also does consider any context).

However, when people say contextual embeddings, they don't mean the vectors from the look-up table, they mean the hiden states of the pre-trained model. As you said these states are contextualized, but it is kind of confusing to call them word embeddings.

Related Question