Solved – Understanding max-pooling and loss of information

conv-neural-networkpooling

In reading a blog post, I encountered the following paragraph:

This is a classical convolutional neural network with three
convolutional layers, followed by two fully connected layers. People
familiar with object recognition networks may notice that there are no
pooling layers. But if you really think about that, then pooling
layers buy you a translation invariance – the network becomes
insensitive to the location of an object in the image. That makes
perfectly sense for a classification task like ImageNet, but for games
the location of the ball is crucial in determining the potential
reward and we wouldn’t want to discard this information!

But in their architecture, they are performing convolution with stride, so the data is still getting downsampled. My intuition with max-pooling is that pooling happens after the learned filters are applied to the data, so the network learns what information is useful to pass to the next layer, and if downsampling is to be performed, it might as well be pooling so that we also get the invariance benefits at the same time. What am I missing? How does convolution with stride>1 preserve spatial information better that stride=1 with pooling?

Best Answer

Max pooling loses information in a sense that it tells you whether a filtered feature was encountered or not, but forgets where in the data, how many times etc.

Suppose your filter is looking for vertical stripes in the image. Without max pooling it will output all stripes found. With max pooling, it will tell you whether there were stripes in the filter output or not. Pretty much zero or one outputs, as opposed to the whole image with stripes marked on it with ones. Max pooling can be viewed as a very crude form of compression in this regard.

It's quite surprising that max pooling actually works given how crude it is. One reason why it does work is because you usually run a battery of filters. For instance, you may run a vertical, horizontal, and stripes at -45 and +45 degrees stripes filters then max pool their output. If you're looking for a rectangular box in the image, having ONE output for -45 and +45 degree stripes, and ZERO output from vertical and horizontal stripe filters after max pooling may suggest that your box is inclined in your image.

Related Question