Inference – How to Use the Transformer for Inference?

attentionnatural languageneural networks

I am trying to understand the transformer model from Attention is all you need, following the annotated transformer.

The architecture looks like this:

transformer

Everything is essentially clear, save for the output embedding on the bottom right. While training, I understand that one can use the actual target as input – all one needs is to

  • shift the target by one position to the right
  • use a mask to prevent using – say – the $n+k$-th word from the output to learn the $n$-th one

What is not clear to me is how to use the model at inference time. When doing inference, one of course does not have the output – what goes there?

Best Answer

A popular method for such sequence generation tasks is beam search. It keeps a number of K best sequences generated so far as the "output" sequences.

In the original paper different beam sizes was used for different tasks. If we use a beam size K=1, it becomes the greedy method in the blog you mentioned.

Related Question