Neural Networks – Why Use Masking for Padding in the Transformer’s Encoder?

natural languageneural networks

I'm currently trying to implement a PyTorch version of the Transformer and had a question.

I've noticed that many implementations apply a mask not just to the decoder but also to the encoder. The official TensorFlow tutorial for the Transformer also states that the Transformer uses something called "MultiHead Attention (with padding masking)."

I'm just confused, why are masks applied to the padding in the encoder sequence?

Best Answer

I hadn't realized this question was unanswered. If I were to attempt to answer my own question, we apply masks to the source data because after the data passes through the Encoder sublayer, there are values for the padding sequences. We don't need nor want the model to attend to these padding sequences, and so we mask them out.

It's slightly different from masking in the decoder in the sense that masking in the decoder takes an additional step of having a "no peeking" mechanism so that our model can't look at future tokens.

Related Question