Solved – Deciding between Decoder-only or Encoder-only Transformers (BERT, GPT)

attentionnatural languageneural networkstransformers

I just started learning about transformers and looked into the following 3 variants

  1. The original one from Attention Is All You Need (Encoder & Decoder)

  2. BERT (Encoder only)

  3. GPT-2 (Decoder only)

How does one generally decide whether their transformer model should include encoders only, decoders only, or both encoders and decoders?

As an example, if I want to train a transformer to read a sequence of images of my backyard then predict whether it will rain in an hour (2 classes "rain" or "not rain"), should this transformer model generally have only decoders?

Best Answer

BERT just need the encoder part of the Transformer, this is true but the concept of masking is different than the Transformer. You mask just a single word (token). So it will provide you the way to spell check your text for instance by predicting if the word is more relevant than the wrd in the next sentence.

My next <mask> will be different.

The GPT-2 is very similar to the decoder-only transformer you are true again, but again not quite. I would argue these are text related models, but since you mentioned images I recall someone told me BERT is conceptually VAE.

So you may use BERT like models and they will have the hidden $h$ state you may use to say about the weather.

I would use GPT-2 or similar models to predict new images based on some start pixels.

However for what you need you need both the encode and the decode ~ transformer, because you wold like to encode background to latent state and than to decode it to the text rain.

Such nets exist and they can annotate the images. But you don't need transformer just simple text and image VAE can work.

Related Question