Solved – kernel size and stride value for fully convolutional network for semantic segmentation

computer visionconv-neural-networkdeep learningmachine learningtensorflow

I am not very clear about some technical details in implementing Fully Convolutional Networks
for Semantic Segmentation
. The paper discusses three models: fcn32, fcn16 and fcn18. According to this description, for fcn16, looks like the last deconvolutional layer has stride 16. But what is the stride for the skip layer from pool 4.
Similarly, for fcn8, looks like the last deconvolutional layer has stride 8. But what is the stride for the skip layer from pool3 and poo4?

In the Tensorflow implementation of this model, author uses stride=2 for all these skip level cases? Are there any justifications for this?

Moreover, for deconvolutional kernel, we also need to know the kernel size. The paper does not mention that. The above implementation using “kernel size = 4”, which can be found from the following definition, where ksize=4 is setup in defining the _upscore_layer. What should be the criteria for setting up this kernel size.

enter image description here

Best Answer

fcn32 uses a stride of 32 because after pool5 the spatial resolution is 2^5=32 times smaller, similarly pool4 should use a stride of 2^4=16 and pool3 should use a stride of 2^3=8.

That tensorflow model first uses stride 2 to upsample pool5 to the same size as pool4, then uses another stride 2 to upsample these two to be the same size as pool3, and finally upsamples these three all together with stride 8. So pool5 gets enlarged 2*2*8=32 times, pool4 gets enlarged 2*8=16 times and pool3 8 times, which is correct.

The reason of doing it this way instead of using strides of 8 16 and 32 separately for each layer is to save the amount of computations. As in the end we are summing them together it's more efficient to sum before upsampling than after.

The kernel size seems more of a design choice, though it is not mentioned in the paper directly, the paper provides a link to their Caffe implementation, which uses a kernel size of 4 for stride 2 layers.