Solved – Why K and V are not the same in Transformer attention

attentionnatural languageneural networks

My understanding is for translation task K should be the same with V, but in Transformer K and V are generated by two different(randomly initialized) matrix $W^K, W^V$, therefore not the same. Can any one tell me why?

Best Answer

I guess the reason why the specific terms "query", "key" and "value" were chosen is that this attention mechanism resembles a memory access mechanism. The query is the specific element for which we seek a representation. The role of the keys is to respond more or less to the query and the values are here to compose an answer. Keys and values are necessarily related but do not play the same roles.

For example, given the word query "network", you might want the key words "neural" and "social" to generate high weights since "neural network" and "social network" are common terms. It means that the dot products between the query and these two keys are high and thus the two key vectors are similar. Nevertheless, the values for "neural" and "social" should be dissimilar since they don't deal with the same topic. Using the same representation for keys and values doesn't allow this.

Somehow, using the same transformation for keys and values might still work, but you'll lose a lot of expressiveness and might need much more parameters to achieve similar performances.

EDIT: I just found a better explanation of the query, key and value terms in this post.