Solved – masking in the attention if all you need paper

attentionnatural languageneural networksrecurrent neural networktensorflow

I am a newbie to the NLP and specifically, the attention is all you need and I can understand the encoder part of the paper.

However, I am baffled about the decoder part. In the pic below and the decoder part, where does it gets the output embeddings from? Why does it need positional encoding? What is Masked Multi-head attention? Why isn't the output of the encoder connected to the input of the decoder? Instead, it is connected to the multi-head attention? Two arrows represent the output of the encoder; what does it describe?

I need a layman's understanding of the above questions.

enter image description here

Best Answer

In the pic below and the decoder part, where does it gets the output embeddings from?

The embedding layer (which is shared with the input embeddings too).

Why does it need positional encoding?

Otherwise the attention mechanism wouldn't be able to distinguish the sentence "The dog ate the apple" from "The apple ate the dog".

What is Masked Multi-head attention?

An autoregressive density model's job is to learn $P(x_i | x_{j<i}; \theta)$. But for efficiency concerns, we'd like to feed the entire sequence $x_{1\ldots n}$ into the model, and then just implement things correctly so that the model doesn't look at any $x_j$ for $j\geq i$ when computing $P(x_i | x_{j<i}; \theta)$. That's called masking.

Why isn't the output of the encoder connected to the input of the decoder? Instead, it is connected to the multi-head attention?

That's also a valid "input" to the decoder. It's not drawn at the bottom of the decoder block, but that doesn't make it any less of an input.

Two arrows represent the output of the encoder; what does it describe?

Attention is a function which takes 3 arguments: values, keys, and queries. The two arrows just show that the same thing is being passed for two of those arguments.

Related Question