VGG16 is a Convolutional Neural Network architecture, presented in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition (ILSVRC-2014 conference presentation here). As explained in the paper,
The width of convolutional layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
Why is the number of channels doubled after each convolutional layer?
Jeremy Howard in the fast.ai course says it is not to lose information. Is it important to maintain the same number of weights on every layer for a convolutional neural network? Can it be beneficial to lose information?
Best Answer
You should really ask in the course forum :) or contact Jeremy on Twitter, he's a great guy. Having said that, the idea is this: subsampling, aka pooling (max pooling, mean pooling, etc.: currently max pooling is the most common choice in CNNs) has three main advantages:
However, not everyone agrees with point 3. In the famous Alexnet paper, which can be considered as the "rebirth" of CNNs, the authors used overlapping neighborhoods (i.e., strides along x and y smaller than the extension of the subsampling neighborhood along x and y respectively) in order to get the same number of features for the input and the output of the subsampling layer. This makes the model more flexible, which is what Jeremy was hinting at. You get a more flexible model, at the risk of more overfitting - but you can use other Deep Learning tools to fight overfitting. It's really a design choice - you'll typically need validation data sets to try different architectures and see what works best.
EDIT: It just occured to me that I have misunderstood what you were asking for. VGG16, unlike Alexnet, uses nonoverlapping max pooling (see chapter 2.1 of the paper I linked, right at the end of the first paragraph). Thus the size of the channels does reduce by 50% after each pooling layer. This is compensated by the doubling of the width of the convolutional layers:
actually, it doesn't happen always - after the penultimate maxpool, the width remains 512, the same as before pooling. Again, this is a design choice: it's not set in stone, as confirmed by the fact that they don't follow this rule for the last convolutional layer. However, it's by far the most common design choice: for example, both LeNet and Alexnet follow this rule, even though LeNet uses nonoverlapping pooling (the size of each channel is halved, as for VGG16), while Alexnet uses overlapping pooling. The idea is simple - you introduce maxpooling to add robustness to noise and to help making the CNN translation equivariant, as I said before. However, you also don't want to throw away information contained in the image, together with the noise. To do that, for each convolutional layer you double the number of channels. This means that you have twice as many "high level features", so to speak, even if each of them contains half as many pixels. If your input image activates one of this high level features, their activation will be passed to following layers.
Granted, this added flexibility adds a risk of overfitting, which they combat with the usual techniques (see chapter 3.1): $L_2$ regularization and dropout for the last two layers, learning rate decay for the whole net.