I am trying to understand the transformer model from Attention is all you need, following the annotated transformer.
The architecture looks like this:
Everything is essentially clear, save for the output embedding on the bottom right. While training, I understand that one can use the actual target as input – all one needs is to
- shift the target by one position to the right
- use a mask to prevent using – say – the $n+k$-th word from the output to learn the $n$-th one
What is not clear to me is how to use the model at inference time. When doing inference, one of course does not have the output – what goes there?
Best Answer
A popular method for such sequence generation tasks is beam search. It keeps a number of K best sequences generated so far as the "output" sequences.
In the original paper different beam sizes was used for different tasks. If we use a beam size K=1, it becomes the greedy method in the blog you mentioned.