A "typical" attention mechanism might assign the weight $w_i$ to one of the source vectors as $w_i \propto \exp(u_i^Tv)$ where $u_i$ is the $i$th "source" vector and $v$ is the query vector. The attention mechanism described in OP from "Pointer Networks" opts for something slightly more involved: $w_i \propto \exp(q^T \tanh(W_1u_i + W_2v))$, but the main ideas are the same -- you can read my answer here for a more comprehensive exploration of different attention mechanisms.
The tutorial mentioned in the question appears to have the peculiar mechanism
$$w_i \propto \exp(a_i^Tv)$$
Where $a_i$ is the $i$th row of a learned weight matrix $A$. I say that it is peculiar because the weight on the $i$th input element does not actually depend on any of the $u_i$ at all! In fact we can view this mechanism as attention over word slots -- how much attention to put to the first word, the second word, third word etc, which does not pay any attention to which words are occupying which slots.
Since $A$, a learned weight matrix, must be fixed in size, then the number of word slots must also be fixed, which means the input sequence length must be constant (shorter inputs can be padded). Of couse this peculiar attention mechanism doesn't really make sense at all, so I wouldn't read too much into it.
Regarding length limitations in general: the only limitation to attention mechanisms is a soft one: longer sequences require more memory, and memory usage scales quadratically with sequence length (compare this to linear memory usage for vanilla RNNs).
I skimmed the "Effective Approaches to Attention-based Neural Machine Translation" paper mentioned in the question, and from what I can tell they propose a two-stage attention mechanism: in the decoder, the network selects a fixed sized window of the input of the encoder outputs to focus on. Then, attention is applied across only those source vectors within the fixed sized window. This is more efficient than typical "global" attention mechanisms.
The key/value/query formulation of attention is from the paper Attention Is All You Need.
How should one understand the queries, keys, and values
The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
The attention operation can be thought of as a retrieval process as well.
As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,
$$c=\sum_{j}\alpha_jh_j$$
where $\sum \alpha_j=1$.
If we restrict $\alpha$ to be a one-hot vector, this operation becomes the same as retrieving from a set of elements $h$ with index $\alpha$. With the restriction removed, the attention operation can be thought of as doing "proportional retrieval" according to the probability vector $\alpha$.
It should be clear that $h$ in this context is the value. The difference between the two papers lies in how the probability vector $\alpha$ is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network $$e_{ij}=a(s_i,h_j), \qquad \alpha_{i,j}=\frac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}$$
where $h_j$ is from the encoder sequence, and $s_i$ is from the decoder sequence. One problem of this approach is, say the encoder sequence is of length $m$ and the decoding sequence is of length $n$, we have to go through the network $m*n$ times to acquire all the attention scores $e_{ij}$.
A more efficient model would be to first project $s$ and $h$ onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like
$$e_{ij}=f(s_i)g(h_j)^T$$
so we only have to compute $g(h_j)$ $m$ times and $f(s_i)$ $n$ times to get the projection vectors and $e_{ij}$ can be computed efficiently by matrix multiplication.
This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)
How are the queries, keys, and values obtained
The proposed multihead attention alone doesn't say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.
$$
\begin{align}\text{MultiHead($Q$, $K$, $V$)} & = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^{O} \\
\text{where head$_i$} & = \text{Attention($QW_i^Q$, $KW_i^K$, $VW_i^V$)}
\end{align}$$
Where the projections are parameter matrices:
$$
\begin{align}
W_i^Q & \in \mathbb{R}^{d_\text{model} \times d_k}, \\
W_i^K & \in \mathbb{R}^{d_\text{model} \times d_k}, \\
W_i^V & \in \mathbb{R}^{d_\text{model} \times d_v}, \\
W_i^O & \in \mathbb{R}^{hd_v \times d_{\text{model}}}.
\end{align}$$
For unsupervised language model training like GPT, $Q, K, V$ are usually from the same source, so such operation is also called self-attention.
For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where $Q$ is from the target sequence and $K, V$ are from the source sequence.
For recommendation systems, $Q$ can be from the target items, $K, V$ can be from the user profile and history.
Best Answer
I think the paper you quote is just wrong.
Luong only generalizes Bahdanau's equations by replacing a single-layer MLP by a general function score (and shows that dot-product can work equally well as the MLP), but it still scores "similarity" of decoder state and exactly one encoder state.