Solved – Why does VGG16 double number of features after each maxpooling layer

conv-neural-networkneural networks

VGG16 is a Convolutional Neural Network architecture, presented in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition (ILSVRC-2014 conference presentation here). As explained in the paper,

The width of convolutional layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

Why is the number of channels doubled after each convolutional layer?

Jeremy Howard in the fast.ai course says it is not to lose information. Is it important to maintain the same number of weights on every layer for a convolutional neural network? Can it be beneficial to lose information?

Best Answer

You should really ask in the course forum :) or contact Jeremy on Twitter, he's a great guy. Having said that, the idea is this: subsampling, aka pooling (max pooling, mean pooling, etc.: currently max pooling is the most common choice in CNNs) has three main advantages:

  1. it makes your net more robust to noise: if you alter slightly each neighborhood in your input layer, then the mean of each neighborhood won't change a lot (the smoothing effect of the sample mean). The max doesn't have this smoothing effect, however since it's the largest activation value, the relative variation due to noise is (on average) smaller than for other pixels.
  2. it introduces some level of translation invariance. By reducing the number of features in the output layer, if you move slightly the input image, chances are the output of the subsampling layer won't change, or it will change less. See here for a nice picture
  3. also, by reducing the number of features, the computational effort in training and predicting is reduced. Also overfitting becomes less likely.

However, not everyone agrees with point 3. In the famous Alexnet paper, which can be considered as the "rebirth" of CNNs, the authors used overlapping neighborhoods (i.e., strides along x and y smaller than the extension of the subsampling neighborhood along x and y respectively) in order to get the same number of features for the input and the output of the subsampling layer. This makes the model more flexible, which is what Jeremy was hinting at. You get a more flexible model, at the risk of more overfitting - but you can use other Deep Learning tools to fight overfitting. It's really a design choice - you'll typically need validation data sets to try different architectures and see what works best.

EDIT: It just occured to me that I have misunderstood what you were asking for. VGG16, unlike Alexnet, uses nonoverlapping max pooling (see chapter 2.1 of the paper I linked, right at the end of the first paragraph). Thus the size of the channels does reduce by 50% after each pooling layer. This is compensated by the doubling of the width of the convolutional layers: enter image description here

actually, it doesn't happen always - after the penultimate maxpool, the width remains 512, the same as before pooling. Again, this is a design choice: it's not set in stone, as confirmed by the fact that they don't follow this rule for the last convolutional layer. However, it's by far the most common design choice: for example, both LeNet and Alexnet follow this rule, even though LeNet uses nonoverlapping pooling (the size of each channel is halved, as for VGG16), while Alexnet uses overlapping pooling. The idea is simple - you introduce maxpooling to add robustness to noise and to help making the CNN translation equivariant, as I said before. However, you also don't want to throw away information contained in the image, together with the noise. To do that, for each convolutional layer you double the number of channels. This means that you have twice as many "high level features", so to speak, even if each of them contains half as many pixels. If your input image activates one of this high level features, their activation will be passed to following layers.

Granted, this added flexibility adds a risk of overfitting, which they combat with the usual techniques (see chapter 3.1): $L_2$ regularization and dropout for the last two layers, learning rate decay for the whole net.

Related Question