Transformers – Understanding the Meaning of the Value Matrix in Self-Attention Mechanisms

attentiontransformers

I'm trying to understand how the self-attention mechanism of the transformer architecture (as proposed by Vaswani et al.) works in detail. I get that self-attention is attention from a token of a sequence to the tokens of the same sequence.

The paper uses the concepts of query, key and value which is aparently derived from retrieval systems. I dont really understand the use of the value. I found this thread, but I don't really get the answer there either.

So let's take an example. Let's say the input sequence is "This forum is awesome". Then to calculate the query vector, I linearly transform the current token (e.g. "This") with a matrix of weights W_Q that are learned during training. In reality, this is apparently bundled in a query matrix $Q$ for every token. I do the same with every token, just with the other matrix $W_K$, where I get the key matrix.

With the scaled dot product I calculate the similarity between my query $\mathrm{embedding}(\text{"This"})\cdot W_Q$ and keys $\mathrm{embedding}(\text{token}) \cdot W_K$ for each token and see which tokens are relevant for "This". (<- is that right?) Now, why do I need to multiply this with the value matrix again, and where does it come from? What's the difference between key and value?

Thanks in advance!

Best Answer

Because Transformers are black-box models, it is hard to say, what the keys and values really are, but the motivation is that might want to retrieve something else than what is your search criterion.

Imagine something like SQL-like query: get phone numbers of people that have a similar name to "Jindrich". "Jindrich" is a query, the criterion for the search. But you do not want similar names from the database, you want the phone numbers. Phone numbers are the values in this case. The keys are the names already in the phonebook.

The projection for the keys and values in the Transformer model can be understood as extracting a relevant piece of information from the hidden states. E.g., in the Transformer Base architecture, the hidden states are 512-dimensional, but the "extracted" keys and values are only 64-dimensional.

Regarding the multiplication: For simplicity, let's assume we have just one query vector $q$ (and not the full matrix $Q$). First, you compute a similarity score for each of the keys:

$$ \alpha = \mathrm{softmax}(qK/\sqrt{d}) = \mathrm{softmax}\left( \frac{(qk_o, qk_1, \ldots, qk_n)}{\sqrt d} \right) $$

The distribution $\alpha$ is a single-dimensional vector, that only tells you how much each of the keys $k_i \in K$ is relevant for the query $q$. In other words, it says at what positions you should retrieve, but you need something to retrieve and these are the values:

$$\alpha V = \sum_{i=0}^m \alpha_i \cdot v_i $$

Related Question