Solved – Convolutional network – how to choose output channels number, stride and padding

classificationconv-neural-networkconvolution

I am trying to create a convolutional network for image classification problem. I am using PyTorch but I have troubles in understending the implementation of their 2D convolutional layer. I understand that in my case there are going to be 3 input channels (RGB) but what about the output ones? How do I know how many of them are supposed to be there? Is there any corelation with the output number and input number or the filter kernel size?

Also I found a code where the convolution is implemented with following parameters and a comment:

nn.Conv2d(3, 20, (5,5))
# img 3 * 32 * 32 -> img 20 * 28 * 28

Is there any reason they chose 20? What's with the 28? I don't get how they came up with this number.

Another problem is with other hyperparameters like stride or padding. Is there any way or maybe some best practices to choose these parameters "correctly"?

Some help would be very appreciated.

Best Answer

The 28 refers to the new dimensions of the images. It equals 28 because there is no padding and you have a 5x5 kernel, so you loose 2 pixels left, right, top and bottom. In order to keep the width and height the same, you would add a padding of 2. Since they chose 20 as the dimension of the output channels, there are now 20 instead of 3.

In deep learning in general:

  1. some parameters are highly important (learning rate), while some others are almost irrelevant (sometimes). Things like the exact number of channels might sometimes tend to be in the latter group, e.g. 32 channels vs. 28, not really a big deal.

but

  1. When paper $i+1$ claims to improve over papers $\{1,...,i-1\}$ by marginally improving some performance metric, often just doing thorough parameter tuning on the earlier papers would have helped more.

So, yes, these variables matter, but they kind of don't. As long as you don't choose really wild/extreme values for these parameters you'll probably be fine.

In articular, the number of "channels" / "filters" , and stride and padding is usually done by: 1) follow existing settings used by other papers because they claim to be good, or 2) try a few settings and compare via some ultimate performance metric, 3) when possible, don't force yourself into choosing one single value, i.e. randomize over some range of those values.