Transformers – How Different Attention Weights Occur for the Same Word in Various Sentences

natural languageneural networkstransformers

I'm trying to understand the transformer architecture for NLP.
The main issue is regarding the attention weights. The same word can have different attention weights in different sentences, right?

Best Answer

Yes, and it is not only the case for Transformer but for nearly any deep learning NLP model. Only when treating natural language data as bag-of-words, the sentence is considered as a sum of independent words in the sentence. This is how, for example, a naive Bayes algorithm would consider the sentence. In many cases it would be enough, this is not a bad model. On another hand, in natural language, the context and order of words matter a lot. Different models differently account for the context. Recurrent neural networks do this by considering the previously observed content (recurrence) together with the current word. Architectures like Transformer look at the whole sentence (assuming a sentence-level model) and the positions of the words in the sentence (position embeddings) and then weight them using the attention weights. If they didn't do this, they would be as "dumb" as naive Bayes.