Solved – Is it true that Bahdanau’s attention mechanism is not Global like Luong’s

attentionmachine learningnatural languageneural networksrecurrent neural network

I was reading the pytorch tutorial on a chatbot task and attention where it said:

Luong et al. improved upon Bahdanau et al.’s groundwork by creating “Global attention”. The key difference is that with “Global attention”, we consider all of the encoder’s hidden states, as opposed to Bahdanau et al.’s “Local attention”, which only considers the encoder’s hidden state from the current time step.

I think that description is plain wrong (or at least confusing). As far as I understand attention in general is the idea that we use a Neural network that depends on the source (or endoder state) and the current target (or decoder) to compute a weight to determine the importance of the current encoder/source in determining the traget/decoder output. Then we do a weighted sum over all context vectors to determine the importance $c_t = \sum^{Tx}_{s=1} \alpha_{s,t} \bar h_s$ where $\alpha_{s,t}$ is the attention from source/encoder $s$ and $\bar h_s$ is the hidden state from the encoder/source at step $s$.

What is confusing is that the Pytorch tutorials claims that Bahdanau's work is NOT global. I don't understand why they say that about Bahdanau's attention mechanism since to me the following is true:

uses all encoders/source states to compute the context vectors via $c_t = \sum^{Tx}_{s=1} \alpha_{s,t} \bar h_s$, especially because $\alpha_{s,t}$ is a function of each source/encoder states. So of course it uses all encoder/source states.

Is there something that I am missing? What is the tutorial reffering to?

Appendix

Perhaps if I go through the equations here carefully I can outline why I think what I do:

Luong's:

Attention is computed as follows:

$$ \alpha_t(s) = \alpha_{s,t} = align(h_t, \bar h_s) = \frac{exp( score(h_t, \bar h_s) ) }{\sum^{T_x}_{s'=1} exp( score(h_t, \bar h_{s'}) )}$$

and the context vector must be:

$$ c_t = \sum^{Tx}_{s=1} \alpha_{s,t} \bar h_s$$

cuz in the paper it says:

Given the alignment vector as weights, the context
vector $c_t$ is computed as the weighted average over
all the source hidden states.

Bahdanau's:

First I will unify their notation.

target/decoder hidden state $h_t = s_t$
encoder/source hidden state $\bar h_s = h_s $
score $score(h_t, \bar h_s ) \sim e_{t,s} = a(s_{t-1}, h_s)$
aligment $\alpha_{s,t} = \alpha_t(s)$

$$ \alpha_{s,t} = \frac{ exp( e_{s,t} ) }{\sum^{T_x}_{s'=1} exp(e_{s',t}) } $$

but the key is that they both use the same equation to compute context vectors:

$$ c_t = \sum^{Tx}_{s=1} \alpha_{s,t} h_s = \sum^{Tx}_{s=1} \alpha_t(s) \bar h_s $$

Of course there are difference in how the compute hidden states and scores but they BOTH are global attention mechanism. Or am I missing something?

Crossposted:

https://qr.ae/TWhezv

https://discuss.pytorch.org/t/is-it-true-that-bahdanaus-attention-mechanism-is-not-global-like-luongs-according-to-pytorchs-tutorial/48425

https://www.reddit.com/r/deeplearning/comments/c2js26/is_it_true_that_bahdanaus_attention_mechanism_is/?

Papers:

Luong's: https://arxiv.org/pdf/1508.04025.pdf
Bahdanau's: https://arxiv.org/pdf/1409.0473.pdf

Best Answer

I think the paper you quote is just wrong.

Luong only generalizes Bahdanau's equations by replacing a single-layer MLP by a general function score (and shows that dot-product can work equally well as the MLP), but it still scores "similarity" of decoder state and exactly one encoder state.

Appendix

Best Answer

Related Solutions

Solved – Why do attention models need to choose a maximum sentence length

Solved – What exactly are keys, queries, and values in attention mechanisms

Related Question