Solved – Understand the output layer of transformer

I'm trying to understand transformer from the paper, attention is all you need. I'm puzzled by the last piece on the linear -> softmax block on the decoder output, and wasn't able to find more information about it.

From the original paper, in section 3.4 Embeddings and Softmax, the authors state that:

Similarly to other sequence transduction models, we use learned
embeddings to convert the input tokens and output tokens to vectors of
dimension dmodel. We also use the usual learned linear transformation
and softmax function to convert the decoder output to predicted
next-token probabilities. In our model, we share the same weight
matrix between the two embedding layers and the pre-softmax linear
transformation, similar to [30]

So this seems that the output of the decoder is a word vector, for example $(d_{model}, 1)$. Here, I think $d_{model}$ is the number of dimensions for the dense projected vector, I might be wrong. Then multiplied by this matrix of $(N_w, d_{model})$ to become a big $(N_w, 1)$ vector, and then send to softmax. Here $N_w$ is the number of vocabulary.

So how did they derive this vector from the decoder? My understanding, probably wrong, is that each multi-head attention layer generates $(maxlen, d_{model})$ matrix. $maxlen$ is the maximum length of all input sentences, padded by 0. So the output of the decoder would be matrix of size $(maxlen_{source} + maxlen_{target}, d_{model})$, instead of a one word-vector.

Thank you!

Best Answer

The decoder has different behaviours in training and inference stage. In the training stage, all the inputs for the decoder are known, so the decoder can generate all the outputs at one feed forward operation. The shape of the decoder's output should be ($maxlen_{target}, d_{model}$). Then multiplied with the pre-softmax linear layer, whose shape is $(N_{w}, d_{model})$, you will get the predicted distribution on the output vocabulary. The equation is shown as follows:

$$P_{(N_{w},maxlen_{target})}=W_{(N_{w}, d_{model})}X_{(maxlen_{target}, d_{model})}^{T}$$

As described in [1], the pre-softmax linear layer can also be treated as word embedding, whose parameters can be shared with the input embedding layer.

Press, Ofir, and Lior Wolf. "Using the Output Embedding to Improve Language Models." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.

Best Answer

Related Solutions

Inference – How to Use the Transformer for Inference?

Related Question