Solved – Why does overlapped pooling help reduce overfitting in conv nets

conv-neural-networkdeep learningoverfitting

In the seminal paper on ImageNet classification with deep conv nets by Krizhevsky et al., 2012, the authors talk about overlapped pooling in convolutional neural networks, in Section 3.4.

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel
map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g.,
[17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling
units spaced s pixels apart, each summarizing a neighborhood of size z × z centered at the location
of the pooling unit. If we set s = z, we obtain traditional local pooling as commonly employed
in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our
network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and
0.3%, respectively, as compared with the non-overlapping scheme s = 2, z = 2, which produces
output of equivalent dimensions. We generally observe during training that models with overlapping
pooling find it slightly more difficult to overfit.

What is the intuition that overlapped pooling helps reduce over-fitting in conv-nets?

Best Answer

I think it's just that larger pooling windows have lower capacity. For example, if we consider the 1D case, you might imagine you have some features like this:

[0 0 5 0 0 6 0 0 3 0 0 4 0 0]

perhaps generated by some regular grid-like pattern in the original image space. With $z=2$ and $s = 2$ the pooled result is

[0, 5, 6, 0, 3, 4, 0]

and it is still apparent that there is some alternation between high values and low values. but when we increase the window size to 3 we get

[5, 5, 6, 3, 3, 4, 0]

and the grid like pattern is completely smoothed-out and lost.

This is just a contrived example, but a good way to think about it in general is that any large value in a feature map will dominate and mask out all other information within a $z$ by $z$ window after max pooling, so the larger $z$ is, the more information is lost.