Neural Networks – Masked Multi-Head Attention and Layer Normalization in Transformer Models

attentiondeep learningneural networksnormalization

I came to read Attention is All you Need by Vaswani. There two questions came up to me:

1. How is it possible to mask out illegal connections in decoder multi-head attention?

It says by setting something to negative infinity, they could prevent leftward information flow. Are they masking out attention weights or the hidden states from previous layer?

2. Is it alright to set some arbitrary max_length for layer normalization?

Let's say I set max_len 200. Whenever a sentence shorter than this comes in, LayerNorm will do whitening(i.e. subtract mean and divide by standard deviation) and linear mapping. The problem, I think is zero padding greatly affects whitening process. If a batch is composed of short sentences, like 10 or 20, then zero paddings are almost 80% of whole batch, which causes whitening makes data more close to zero-norm. Is this orthodox method? or is there any other practice?

Best Answer

This is answered in the Attention is All You Need paper by Vaswani et al (see also recording of the talk by one of the co-authors, and those three blogs: here, here, and here).

  1. How is it possible to mask out illegal connections in decoder multi-head attention?

This is pretty simple. Attention can be defined as

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V $$

where $Q$ are queries, $K$ are keys, $V$ are values and $\sqrt{d_k}$ is the scaling constant equal to the square root of the dimension of the keys. The role of the product $QK^T$ is to calculate the similarity matrix between words in $Q$ and $K$ (where each word is a row encoded using embeddings). In the encoder, each $Q,K,V$, comes from the same document. In the decoder, $Q$ comes from target document, while $K,V$ come from source document.

In the Transformer network (and similar ones), there is no direct mechanism that records time dependence. It is recorded indirectly in the embeddings (by summing word embeddings and position embeddings), but at a cost of leaking "future" values when making predictions. Notice that in $QK^T$ we look at similarities between at each word in $Q$ with each word in $K$. To prevent the future leak, we use masking. This is done by performing a pointwise product of $QK^T$ and the upper triangular matrix of ones (illustrated below, image source).

enter image description here

This it zeroes-out the similarities between words and the words that appear after the source words ("in the future"), preventing predictions from depending on knowing the answer before they predict it. Since we remove such information, it cannot be used by the model, and we guarantee that only similarity to the preceding words is considered.

  1. Is it alright to set some arbitrary max_length for layer normalization?

In the paper, all the inputs and outputs have fixed size of $d_\text{model}$, if this is what you ask. However I can't see why this would be a problem, since what the normalization does is it makes the features have same mean and standard deviation between the layers. So if something was relatively large locally, will be mapped to what is considered large globally. See the Layer normalization paper by Ba et al for details. Moreover, this is applied per feature, so excess zeros have no impact.

Related Question