Solved – Why global average pooling is able to work correctly

classificationdeep learningneural networkspooling

In Network in Newtork, section 3.2, talks about global average pooling (GAP).

In my understanding, GAP averages every value of (x,y) coordinate in 1 feature map into 1 value, then send this value to softmax function for classification.

Why is this able to work? I can easily generate some counter examples whose feature map differs but have identical (or similar, as in more practical cases) GAP output.

For example, if we have the last layer feature map of

0.9 0.1 0.9
0.1 0.5 0.1
0.9 0.1 0.9

Which represents a dog, the output of GAP is ((0.9+0.1)*4+0.5)/9=0.5

Assuming another feature map of cat is

0.1 0.9 0.1
0.9 0.5 0.9
0.1 0.9 0.1

Maybe there is one another animal, neither cats nor dogs, whose feature map is

0.5 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.5

All of the 3 cases, GAP are all 0.5.

Say 0.9 strongly matches the feature of a cat/dog, 0.1 is least likely, while 0.5 is unsure.

Strong and weak activation outputs can have the same average after GAP as of 2 uncertain outputs. With only 1 GAP output value, how can we tell what this 0.5 represents?

Is there anything wrong with my understanding or example?
Why is GAP able to classify correctly?

Best Answer

You are correct with your counter-example, network outputting such feature maps would not work using GAP. However, if the network was outputting such feature maps, it would also have a high training error. I expect that networks using GAP must therefore learn outputting such feature maps that don't suffer from this issue.

That means: if there is no dog in the picture, the dog feature map must be close to zero everywhere. On the other hand, if there is a cat in the picture, it should be detected at as many locations as possible (this depends mainly on the receptive field of convolutions in each layer: how large portion of the original input image is used to calculate the actual point in a feature map).

Generally, I have not seen GAP being used very often these days (the paper dates back to 2013).

Related Question