Attention Mechanism – Understanding Keys, Queries, and Values in Attention Mechanisms

attentionmachine-translationnatural languageneural networks

How should one understand the keys, queries, and values that are often mentioned in attention mechanisms?

I've tried searching online, but all the resources I find only speak of them as if the reader already knows what they are.

Judging by the paper written by Bahdanau (Neural Machine Translation by Jointly Learning to Align and Translate), it seems as though values are the annotation vector $h$ but it's not clear as to what is meant by "query" and "key."

The paper that I mentioned states that attention is calculated by

$$c_i = \sum^{T_x}_{j = 1} \alpha_{ij} h_j$$

with

$$
\begin{align}
\alpha_{ij} & = \frac{e^{e_{ij}}}{\sum^{T_x}_{k = 1} e^{ik}} \\\\
e_{ij} & = a(s_{i – 1}, h_j)
\end{align}
$$

Where are people getting the key, query, and value from these equations?

Thank you.

Best Answer

The key/value/query formulation of attention is from the paper Attention Is All You Need.

How should one understand the queries, keys, and values

The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).

The attention operation can be thought of as a retrieval process as well.

As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,

$$c=\sum_{j}\alpha_jh_j$$ where $\sum \alpha_j=1$.

If we restrict $\alpha$ to be a one-hot vector, this operation becomes the same as retrieving from a set of elements $h$ with index $\alpha$. With the restriction removed, the attention operation can be thought of as doing "proportional retrieval" according to the probability vector $\alpha$.

It should be clear that $h$ in this context is the value. The difference between the two papers lies in how the probability vector $\alpha$ is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network $$e_{ij}=a(s_i,h_j), \qquad \alpha_{i,j}=\frac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}$$ where $h_j$ is from the encoder sequence, and $s_i$ is from the decoder sequence. One problem of this approach is, say the encoder sequence is of length $m$ and the decoding sequence is of length $n$, we have to go through the network $m*n$ times to acquire all the attention scores $e_{ij}$.

A more efficient model would be to first project $s$ and $h$ onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like $$e_{ij}=f(s_i)g(h_j)^T$$ so we only have to compute $g(h_j)$ $m$ times and $f(s_i)$ $n$ times to get the projection vectors and $e_{ij}$ can be computed efficiently by matrix multiplication.

This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)

How are the queries, keys, and values obtained

The proposed multihead attention alone doesn't say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.

$$ \begin{align}\text{MultiHead($Q$, $K$, $V$)} & = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^{O} \\ \text{where head$_i$} & = \text{Attention($QW_i^Q$, $KW_i^K$, $VW_i^V$)} \end{align}$$ Where the projections are parameter matrices: $$ \begin{align} W_i^Q & \in \mathbb{R}^{d_\text{model} \times d_k}, \\ W_i^K & \in \mathbb{R}^{d_\text{model} \times d_k}, \\ W_i^V & \in \mathbb{R}^{d_\text{model} \times d_v}, \\ W_i^O & \in \mathbb{R}^{hd_v \times d_{\text{model}}}. \end{align}$$

For unsupervised language model training like GPT, $Q, K, V$ are usually from the same source, so such operation is also called self-attention.

For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where $Q$ is from the target sequence and $K, V$ are from the source sequence.

For recommendation systems, $Q$ can be from the target items, $K, V$ can be from the user profile and history.

Related Question