Transformers – Understanding BERT MLM with 80% [MASK], 10% Random Words, and 10% Same Word

machine learningnatural languagetransformers

I have noticed that (from the original BERT paper) in the MLM training procedure, the authors decide to mask 15% of the words in a sentence. The mask works as following:

The masked words are distributed as follows

  1. 80% are replaced with [MASK] token (which makes perfect sense, just teach the model to learn some words given the left and right context)
  2. 10% are replaced by some random word. This makes some sense to me (https://stackoverflow.com/questions/64013808/why-bert-model-have-to-keep-10-mask-token-unchanged). My understanding is that this way the model learns to be influenced from the word it is trying to predict. That is, it does not consider only the left and right part of the sentence, but also the word itself. So masking with some random word would teach the model to actually consider the mirror words, but since the percentage is very small (1.5%), it would not confuse the model so much, so this might be beneficial.
  3. 10% of the words are unchanged. Now I completely don't understand this. For example, I don't understand the difference between: {90% masked with [MASK] and 10% masked with random word} and {80% [MASK], 10% random, 10% same word}. The authors indicate: The purpose of this is to bias the representation towards the actual observed word. Isn't this the exact purpose of the random mirror word placed? The only thing that makes sense to me is that random word replacement teaches the model to consider the mirror word, and the same-word counters the effect of the random word so that the model doesn't get confused logically?

Best Answer

The answer to your question is in §3.1 of the paper.

First, bear in mind that only the “masked” tokens (about 15%) are predicted during training, not all tokens. With that in mind, I would teach it in the reverse order of what’s in the paper. This ordering shows the value of predicting the observed word.

  1. It’s a normal, ordinary thing done for decades to predict the word in a position given its context. They do this, too, so that the representations of a word and its context are encouraged to be similar. BERT isn’t autoregressive, so it winds up seeing the word already (sort of like autoencoding), but the value is in relating the word to its context.
  2. For robustness, they also predict the right word when provided the wrong word: a randomly chosen one. This forces the model to lean on context more than on the word itself.
  3. Finally, further increasing the value of context, we train when no word at all is provided, learning to fill the slot only from the context.
Related Question