As I understand, out of all masked tokens in BERT
- Replace some with [mask], this is because of MLM
- Replace some with other token, this will force model to generate proper contextual embedding for all tokens in the sequence, not only the [mask] ones. This is consistent with the goal of finetuning.
But I don't understand why BERT keep some masked tokens unchanged, could anyone please help me to understand it?
Best Answer
We can find the answer from the paper:
The purpose of BERT is to learn a representation for each token(conditioning on the rest of tokens and itself) and the representation relates not only to the rest of tokens but also the token itself, and one task becomes how to let our model learn when to depend on the embedding(a numerical feature of the token input into the model) and how it depends on it. Without the unchanged tokens, it just ignores the representation/embedding in the mirror position(note that the position embedding works and the embedding itself also can make it so) because the model learns that it would just be of no information(randomly chosen from the very large vocabulary or just MASK) in the training process and it just does the same in the inference mode(when we actually use it).
And we can continue reading: