Transformers – Understanding BERT MLM with 80% [MASK], 10% Random Words, and 10% Same Word

machine learningnatural languagetransformers

I have noticed that (from the original BERT paper) in the MLM training procedure, the authors decide to mask 15% of the words in a sentence. The mask works as following:

The masked words are distributed as follows

80% are replaced with [MASK] token (which makes perfect sense, just teach the model to learn some words given the left and right context)
10% are replaced by some random word. This makes some sense to me (https://stackoverflow.com/questions/64013808/why-bert-model-have-to-keep-10-mask-token-unchanged). My understanding is that this way the model learns to be influenced from the word it is trying to predict. That is, it does not consider only the left and right part of the sentence, but also the word itself. So masking with some random word would teach the model to actually consider the mirror words, but since the percentage is very small (1.5%), it would not confuse the model so much, so this might be beneficial.
10% of the words are unchanged. Now I completely don't understand this. For example, I don't understand the difference between: {90% masked with [MASK] and 10% masked with random word} and {80% [MASK], 10% random, 10% same word}. The authors indicate: The purpose of this is to bias the representation towards the actual observed word. Isn't this the exact purpose of the random mirror word placed? The only thing that makes sense to me is that random word replacement teaches the model to consider the mirror word, and the same-word counters the effect of the random word so that the model doesn't get confused logically?

Best Answer

The answer to your question is in §3.1 of the paper.

First, bear in mind that only the “masked” tokens (about 15%) are predicted during training, not all tokens. With that in mind, I would teach it in the reverse order of what’s in the paper. This ordering shows the value of predicting the observed word.

It’s a normal, ordinary thing done for decades to predict the word in a position given its context. They do this, too, so that the representations of a word and its context are encouraged to be similar. BERT isn’t autoregressive, so it winds up seeing the word already (sort of like autoencoding), but the value is in relating the word to its context.
For robustness, they also predict the right word when provided the wrong word: a randomly chosen one. This forces the model to lean on context more than on the word itself.
Finally, further increasing the value of context, we train when no word at all is provided, learning to fill the slot only from the context.

Related Solutions

Neural Networks – Why BERT Keeps Some Masked Tokens Unchanged? An In-Depth Exploration

We can find the answer from the paper:

• 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
• 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
• 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word.

The purpose of BERT is to learn a representation for each token(conditioning on the rest of tokens and itself) and the representation relates not only to the rest of tokens but also the token itself, and one task becomes how to let our model learn when to depend on the embedding(a numerical feature of the token input into the model) and how it depends on it. Without the unchanged tokens, it just ignores the representation/embedding in the mirror position(note that the position embedding works and the embedding itself also can make it so) because the model learns that it would just be of no information(randomly chosen from the very large vocabulary or just MASK) in the training process and it just does the same in the inference mode(when we actually use it).

And we can continue reading:

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability. In Section C.2, we evaluate the impact this procedure

Machine Learning – Do BERT Word Embeddings Change After Training?

This is a great question (I had the same question but you asking it made me experiment a bit).

The answer is yes, it changes based on the context. You should not extract the embeddings and re-use them (at least for most of the problems).

I'm checking the embedding for word bank in two cases: (1) when it comes separately and (2) when it comes with a context (river bank). The embeddings that I'm getting are different from each other (they have a cosine distance of ~0.4).

from transformers import TFBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

print('bank is the second word in tokenization (index=1):', tokenizer.decode([i for i in tokenizer.encode('bank')]))
print('bank is the third word in tokenization (index=2):', tokenizer.decode([i for i in tokenizer.encode('river bank')]))
###output: bank is the second word in tokenization (index=1): [CLS] bank [SEP]
###output: bank is the third word in tokenization (index=2): [CLS] river bank [SEP]

bank_bank = model(tf.constant(tokenizer.encode('bank'))[None,:])[0][0,1,:] #use the index based on the tokenizer output above
river_bank_bank = model(tf.constant(tokenizer.encode('river bank'))[None,:])[0][0,2,:] #use the index based on the tokenizer output above

are_equal = np.allclose(bank_bank, river_bank_bank)

print(are_equal)
### output: False

Best Answer

Related Solutions

Neural Networks – Why BERT Keeps Some Masked Tokens Unchanged? An In-Depth Exploration

Machine Learning – Do BERT Word Embeddings Change After Training?

Related Question