Neural Networks – Why BERT Keeps Some Masked Tokens Unchanged? An In-Depth Exploration

natural languageneural networks

As I understand, out of all masked tokens in BERT

  1. Replace some with [mask], this is because of MLM
  2. Replace some with other token, this will force model to generate proper contextual embedding for all tokens in the sequence, not only the [mask] ones. This is consistent with the goal of finetuning.

But I don't understand why BERT keep some masked tokens unchanged, could anyone please help me to understand it?

Best Answer

We can find the answer from the paper:

• 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
• 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
• 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word.

The purpose of BERT is to learn a representation for each token(conditioning on the rest of tokens and itself) and the representation relates not only to the rest of tokens but also the token itself, and one task becomes how to let our model learn when to depend on the embedding(a numerical feature of the token input into the model) and how it depends on it. Without the unchanged tokens, it just ignores the representation/embedding in the mirror position(note that the position embedding works and the embedding itself also can make it so) because the model learns that it would just be of no information(randomly chosen from the very large vocabulary or just MASK) in the training process and it just does the same in the inference mode(when we actually use it).

And we can continue reading:

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability. In Section C.2, we evaluate the impact this procedure

Related Question