About basic understanding of attention mechanism and model weights

attentionimage processingneural networks

In the image domain, for a given image, suppose that we want to understand if there is a bird on that image (0 and 1 label = no bird and bird), attention mechanism helps to pay more attention to the birds and like blurring (giving less attention) the rest of the image. However, the thing that I do not understand, isn't this the same logic as the weights we learned in a deep model? The system learns what should be given more importance in order to learn the labels for the input. I can't understand what is the difference between the weight logic.

Best Answer

Backpropagation Deep neural networks (DNN) learn (in 99.99% of the cases) via backpropagation, which is stochastic gradient descent. I.e. the model "sees" that it could reduce the cost of the current batch a little by changing the weights a little, so it does it. Repeating this millions of times leads you to a local minimum of the cost. And this always happens, also when your NN is using attention mechanisms.

Inductive Bias Now, people learned the hard way that by simply creating very deep feedforward NNs (FNNs), even with the very useful ideas of weight initialization by e.g. Glorot or He, they would not get very far, deep neural networks still didn't perform too well. What was needed was some help from humans, which substituted ordinary FNNs with special architecture DNNs, that they figured would make it easier for backpropagation to learn better. Those new architectures are essentially assumptions on what is actually needed and what can be removed from the unrestricted FNN. Applying human insight to restrict model architecture is also called "providing inductive bias". The first and most famous example of inductive bias for DNNs was the convolutional neural network (CNN): Yan LeCun figured that a DNN learns better if some layers are restricted to convolutions.

And the same is true for the inductive bias of all the other architectures like RNNs, LSTMs, WNNs, ResNets...

Many variances of attention mechanisms The inductive bias called "attention mechanism" is not different. Once again, ingenious humans (i.e. no AI) have figured that a certain architecture would improve the performance. There are many kinds of attention mechanisms, including rather sophisticated versions like the self-attention used in transformers. And it is not immediately evident that those special structures do improve performance. It takes lots of intuition and much more simple trial and error to come up with a new working architecture.

But they are all the same in that they are simply special DNN architectures that make it easier for backpropagation to learn the right answers.

Fundamental idea of attention mechanisms As mentioned, attention mechanisms can get very complicated and intuition is soon lost. But, roughly, to call a new architecture an "attention mechanism", it should provide some kind of context and some attention module that uses this context to control the inference. One famous simple approach is that the context input is simply multiplied with the rest of the input vectors, then normalized, and finally used as weights for the rest of the input vectors. Thus the inductive bias is that you provide for the special architecture, like the extra context neurons, the scalar products per input partition, and the weighting of those. And once you have provided for all this new architecture, you just keep your fingers crossed that backpropagation will actually figure out how to use it to improve performance.

Related Question