Neural Networks – Why Do Attention Models Need to Choose a Maximum Sentence Length?

attentionnatural languageneural networksrecurrent neural network

I was going through the seq2seq-translation tutorial on pytorch and found the following sentence:

Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

which didn't really make sense to me. My understanding of attention is that attention is computed as follows (according to the Pointer Network paper) at time step $t$:

$$ u^{<t,j>} = v^\top tanh( W_1 e_j + W_2 d_{t} ) = NN_u(e_j, d_t )$$
$$ \alpha^{<t,j>} = softmax( u^{<t,j>} ) = \frac{exp(u^{<t,j>})}{Z^{<t>}} = \frac{exp(u^{<t,j>})}{\sum^{T_x}_{k=1} exp( u^{<t,k>} ) } $$
$$ d'_{<i+1>} = \sum^{T_x}_{j=1} \alpha^{<t,j>} e_j $$

which basically means that a specific attention weight is not dependent on the length of the encoder (i.e. the encoder can change size and the above equation won't be affected because $T_x$ can be variable size).

If that is true then why does the paper say this maximum sentence length thing?

They also say:

There are other forms of attention that work around the length limitation by using a relative position approach. Read about “local attention” in Effective Approaches to Attention-based Neural Machine Translation.

which also confused me. Any clarification?


Perhaps related:

https://discuss.pytorch.org/t/attentiondecoderrnn-without-max-length/13473


Crossposted:

https://discuss.pytorch.org/t/why-do-attention-models-need-to-choose-a-maximum-sentence-length/47201

https://www.reddit.com/r/deeplearning/comments/bxbypj/why_do_attention_models_need_to_choose_a_maximum/?

Best Answer

A "typical" attention mechanism might assign the weight $w_i$ to one of the source vectors as $w_i \propto \exp(u_i^Tv)$ where $u_i$ is the $i$th "source" vector and $v$ is the query vector. The attention mechanism described in OP from "Pointer Networks" opts for something slightly more involved: $w_i \propto \exp(q^T \tanh(W_1u_i + W_2v))$, but the main ideas are the same -- you can read my answer here for a more comprehensive exploration of different attention mechanisms.


The tutorial mentioned in the question appears to have the peculiar mechanism

$$w_i \propto \exp(a_i^Tv)$$

Where $a_i$ is the $i$th row of a learned weight matrix $A$. I say that it is peculiar because the weight on the $i$th input element does not actually depend on any of the $u_i$ at all! In fact we can view this mechanism as attention over word slots -- how much attention to put to the first word, the second word, third word etc, which does not pay any attention to which words are occupying which slots.

Since $A$, a learned weight matrix, must be fixed in size, then the number of word slots must also be fixed, which means the input sequence length must be constant (shorter inputs can be padded). Of couse this peculiar attention mechanism doesn't really make sense at all, so I wouldn't read too much into it.


Regarding length limitations in general: the only limitation to attention mechanisms is a soft one: longer sequences require more memory, and memory usage scales quadratically with sequence length (compare this to linear memory usage for vanilla RNNs).

I skimmed the "Effective Approaches to Attention-based Neural Machine Translation" paper mentioned in the question, and from what I can tell they propose a two-stage attention mechanism: in the decoder, the network selects a fixed sized window of the input of the encoder outputs to focus on. Then, attention is applied across only those source vectors within the fixed sized window. This is more efficient than typical "global" attention mechanisms.

Related Question