Solved – Where is dropout placed in the original transformer

I wanted to know where dropout was placed in the original transformer. According to the original paper (https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) they say:

Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
P_drop = 0.1.

which makes me think they do the following:

assert SubLayer is FullyConnected or MultiHeadedSeltAttention (not the output of LN+Add)
x = SubLayer(x)
x = torch.nn.dropout(x, p=0.1)
x = nn.LayerNorm(x) + x
x = nn.ReLU(x)

Does this sound right? Basically I guess it's unclear what "sub-layer means".

For we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks I assume they mean that the input to the decoder and encoder both have a dropout right after the positional encoding, something like this for both decoder and encoder:

def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

Best Answer

The sublayers refer to the self/cross multi-head attention layers, as well as the position-wise feedfoward networks.

Your code is mostly correct, but:

your pseudocode accidentally overwrites the value of the original x.
The layer norm is applied after the residual addition.
there's no ReLU in the transformer (other than within the position-wise feed-forward networks)

So it should be

x2 = SubLayer(x)
x2 = torch.nn.dropout(x2, p=0.1)
x = nn.LayerNorm(x2 + x)

You can find a good writeup at The Annotated Transformer.

Best Answer

Related Solutions

Solved – Dropout: scaling the activation versus inverting the dropout

Solved – Where should I place dropout layers in a neural network

Related Question