Attention is a method for aggregating a set of vectors $v_i$ into just one vector, often via a lookup vector $u$. Usually, $v_i$ is either the inputs to the model or the hidden states of previous time-steps, or the hidden states one level down (in the case of stacked LSTMs).
The result is often called the context vector $c$, since it contains the context relevant to the current time-step.
This additional context vector $c$ is then fed into the RNN/LSTM as well (it can be simply concatenated with the original input). Therefore, the context can be used to help with prediction.
The simplest way to do this is to compute probability vector $p = \text{softmax}(V^Tu)$ and $c = \sum_i p_i v_i$ where $V$ is the concatenation of all previous $v_i$. A common lookup vector $u$ is the current hidden state $h_t$.
There are many variations on this, and you can make things as complicated as you want. For example, instead using $v_i^T u$ as the logits, one may choose $f(v_i, u)$ instead, where $f$ is an arbitrary neural network.
A common attention mechanism for sequence-to-sequence models uses $p = \text{softmax}(q^T \tanh(W_1 v_i + W_2 h_t))$, where $v$ are the hidden states of the encoder, and $h_t$ is the current hidden state of the decoder. $q$ and both $W$s are parameters.
Some papers which show off different variations on the attention idea:
Pointer Networks use attention to reference inputs in order to solve combinatorial optimization problems.
Recurrent Entity Networks maintain separate memory states for different entities (people/objects) while reading text, and update the correct memory state using attention.
Transformer models also make extensive use of attention. Their formulation of attention is slightly more general and also involves key vectors $k_i$: the attention weights $p$ are actually computed between the keys and the lookup, and the context is then constructed with the $v_i$.
Here is a quick implementation of one form of attention, although I can't guarantee correctness beyond the fact that it passed some simple tests.
Basic RNN:
def rnn(inputs_split):
bias = tf.get_variable('bias', shape = [hidden_dim, 1])
weight_hidden = tf.tile(tf.get_variable('hidden', shape = [1, hidden_dim, hidden_dim]), [batch, 1, 1])
weight_input = tf.tile(tf.get_variable('input', shape = [1, hidden_dim, in_dim]), [batch, 1, 1])
hidden_states = [tf.zeros((batch, hidden_dim, 1), tf.float32)]
for i, input in enumerate(inputs_split):
input = tf.reshape(input, (batch, in_dim, 1))
last_state = hidden_states[-1]
hidden = tf.nn.tanh( tf.matmul(weight_input, input) + tf.matmul(weight_hidden, last_state) + bias )
hidden_states.append(hidden)
return hidden_states[-1]
With attention, we add only a few lines before the new hidden state is computed:
if len(hidden_states) > 1:
logits = tf.transpose(tf.reduce_mean(last_state * hidden_states[:-1], axis = [2, 3]))
probs = tf.nn.softmax(logits)
probs = tf.reshape(probs, (batch, -1, 1, 1))
context = tf.add_n([v * prob for (v, prob) in zip(hidden_states[:-1], tf.unstack(probs, axis = 1))])
else:
context = tf.zeros_like(last_state)
last_state = tf.concat([last_state, context], axis = 1)
hidden = tf.nn.tanh( tf.matmul(weight_input, input) + tf.matmul(weight_hidden, last_state) + bias )
the full code
Best Answer
The key/value/query formulation of attention is from the paper Attention Is All You Need.
The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
The attention operation can be thought of as a retrieval process as well.
As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,
$$c=\sum_{j}\alpha_jh_j$$ where $\sum \alpha_j=1$.
If we restrict $\alpha$ to be a one-hot vector, this operation becomes the same as retrieving from a set of elements $h$ with index $\alpha$. With the restriction removed, the attention operation can be thought of as doing "proportional retrieval" according to the probability vector $\alpha$.
It should be clear that $h$ in this context is the value. The difference between the two papers lies in how the probability vector $\alpha$ is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network $$e_{ij}=a(s_i,h_j), \qquad \alpha_{i,j}=\frac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}$$ where $h_j$ is from the encoder sequence, and $s_i$ is from the decoder sequence. One problem of this approach is, say the encoder sequence is of length $m$ and the decoding sequence is of length $n$, we have to go through the network $m*n$ times to acquire all the attention scores $e_{ij}$.
A more efficient model would be to first project $s$ and $h$ onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like $$e_{ij}=f(s_i)g(h_j)^T$$ so we only have to compute $g(h_j)$ $m$ times and $f(s_i)$ $n$ times to get the projection vectors and $e_{ij}$ can be computed efficiently by matrix multiplication.
This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)
The proposed multihead attention alone doesn't say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.
For unsupervised language model training like GPT, $Q, K, V$ are usually from the same source, so such operation is also called self-attention.
For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where $Q$ is from the target sequence and $K, V$ are from the source sequence.
For recommendation systems, $Q$ can be from the target items, $K, V$ can be from the user profile and history.