We observe these kind of redundancies in literally all neural network architectures, starting from simple fully-connected networks (see diagram below), where same inputs are mapped to multiple hidden layers. Nothing prohibits the network from ending up with same weights in here as well.
We fight this by random initialization of weights. You usually need to initialize all the weights randomly, unless some special cases where initializing with zeros or other values proved to worked better. The optimization algorithms are deterministic, so there is no reason whatsoever why the same inputs could lead to different outputs if all the initial conditions were the same.
Same seems to be true for the original attention paper, but to convince yourself, you can check also this great "annotated" paper with PyTorch code (or Keras implementation if you prefer) and this blog post. Unless I missed something from the paper and the implementations, the weights are treated the same in each case, so there is not extra measures to prevent redundancy. In fact, if you look at the code in the "annotated Transformer" post, in the MultiHeadedAttention
class you can see that all the weights in multi-head attention layer are generated using same kind of nn.Linear
layers.
The key/value/query formulation of attention is from the paper Attention Is All You Need.
How should one understand the queries, keys, and values
The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
The attention operation can be thought of as a retrieval process as well.
As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,
$$c=\sum_{j}\alpha_jh_j$$
where $\sum \alpha_j=1$.
If we restrict $\alpha$ to be a one-hot vector, this operation becomes the same as retrieving from a set of elements $h$ with index $\alpha$. With the restriction removed, the attention operation can be thought of as doing "proportional retrieval" according to the probability vector $\alpha$.
It should be clear that $h$ in this context is the value. The difference between the two papers lies in how the probability vector $\alpha$ is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network $$e_{ij}=a(s_i,h_j), \qquad \alpha_{i,j}=\frac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}$$
where $h_j$ is from the encoder sequence, and $s_i$ is from the decoder sequence. One problem of this approach is, say the encoder sequence is of length $m$ and the decoding sequence is of length $n$, we have to go through the network $m*n$ times to acquire all the attention scores $e_{ij}$.
A more efficient model would be to first project $s$ and $h$ onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like
$$e_{ij}=f(s_i)g(h_j)^T$$
so we only have to compute $g(h_j)$ $m$ times and $f(s_i)$ $n$ times to get the projection vectors and $e_{ij}$ can be computed efficiently by matrix multiplication.
This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)
How are the queries, keys, and values obtained
The proposed multihead attention alone doesn't say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.
$$
\begin{align}\text{MultiHead($Q$, $K$, $V$)} & = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^{O} \\
\text{where head$_i$} & = \text{Attention($QW_i^Q$, $KW_i^K$, $VW_i^V$)}
\end{align}$$
Where the projections are parameter matrices:
$$
\begin{align}
W_i^Q & \in \mathbb{R}^{d_\text{model} \times d_k}, \\
W_i^K & \in \mathbb{R}^{d_\text{model} \times d_k}, \\
W_i^V & \in \mathbb{R}^{d_\text{model} \times d_v}, \\
W_i^O & \in \mathbb{R}^{hd_v \times d_{\text{model}}}.
\end{align}$$
For unsupervised language model training like GPT, $Q, K, V$ are usually from the same source, so such operation is also called self-attention.
For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where $Q$ is from the target sequence and $K, V$ are from the source sequence.
For recommendation systems, $Q$ can be from the target items, $K, V$ can be from the user profile and history.
Best Answer
Attention is a method for aggregating a set of vectors $v_i$ into just one vector, often via a lookup vector $u$. Usually, $v_i$ is either the inputs to the model or the hidden states of previous time-steps, or the hidden states one level down (in the case of stacked LSTMs).
The result is often called the context vector $c$, since it contains the context relevant to the current time-step.
This additional context vector $c$ is then fed into the RNN/LSTM as well (it can be simply concatenated with the original input). Therefore, the context can be used to help with prediction.
The simplest way to do this is to compute probability vector $p = \text{softmax}(V^Tu)$ and $c = \sum_i p_i v_i$ where $V$ is the concatenation of all previous $v_i$. A common lookup vector $u$ is the current hidden state $h_t$.
There are many variations on this, and you can make things as complicated as you want. For example, instead using $v_i^T u$ as the logits, one may choose $f(v_i, u)$ instead, where $f$ is an arbitrary neural network.
A common attention mechanism for sequence-to-sequence models uses $p = \text{softmax}(q^T \tanh(W_1 v_i + W_2 h_t))$, where $v$ are the hidden states of the encoder, and $h_t$ is the current hidden state of the decoder. $q$ and both $W$s are parameters.
Some papers which show off different variations on the attention idea:
Pointer Networks use attention to reference inputs in order to solve combinatorial optimization problems.
Recurrent Entity Networks maintain separate memory states for different entities (people/objects) while reading text, and update the correct memory state using attention.
Transformer models also make extensive use of attention. Their formulation of attention is slightly more general and also involves key vectors $k_i$: the attention weights $p$ are actually computed between the keys and the lookup, and the context is then constructed with the $v_i$.
Here is a quick implementation of one form of attention, although I can't guarantee correctness beyond the fact that it passed some simple tests.
Basic RNN:
With attention, we add only a few lines before the new hidden state is computed:
the full code