I am not very clear about some technical details in implementing Fully Convolutional Networks
for Semantic Segmentation. The paper discusses three models: fcn32, fcn16 and fcn18. According to this description, for fcn16, looks like the last deconvolutional layer has stride 16. But what is the stride for the skip layer from pool 4.
Similarly, for fcn8, looks like the last deconvolutional layer has stride 8. But what is the stride for the skip layer from pool3 and poo4?
In the Tensorflow implementation of this model, author uses stride=2
for all these skip level cases? Are there any justifications for this?
Moreover, for deconvolutional kernel, we also need to know the kernel size. The paper does not mention that. The above implementation using “kernel size = 4”, which can be found from the following definition, where ksize=4 is setup in defining the _upscore_layer. What should be the criteria for setting up this kernel size.
Best Answer
fcn32
uses a stride of 32 because afterpool5
the spatial resolution is2^5=32
times smaller, similarlypool4
should use a stride of2^4=16
andpool3
should use a stride of2^3=8
.That tensorflow model first uses stride 2 to upsample
pool5
to the same size aspool4
, then uses another stride 2 to upsample these two to be the same size aspool3
, and finally upsamples these three all together with stride 8. Sopool5
gets enlarged2*2*8=32
times,pool4
gets enlarged2*8=16
times andpool3
8
times, which is correct.The reason of doing it this way instead of using strides of
8
16
and32
separately for each layer is to save the amount of computations. As in the end we are summing them together it's more efficient to sum before upsampling than after.The kernel size seems more of a design choice, though it is not mentioned in the paper directly, the paper provides a link to their Caffe implementation, which uses a kernel size of 4 for stride 2 layers.