Solved – a simplified version of fully convolutional network

computer visionconv-neural-networkdeep learningmachine learning

In the paper of fully convolutional networks semantic segmentation, the authors adopts up-sampling (de-convolutional network) to recover the feature maps with reduced dimensions (due to the multiple layers of down-sampling) to the original size.

If we do not do any down-sampling, i.e., using stride 1 in convolutional layer and pooling layer, and thus keep the image size after multiple layers of convolutions, can we just do the pixel-pixel semantic segmentation on the feature map of this layer without having to resort to upsampling as proposed by the paper.

I can imagine the performance can be worse, but I want to know whether this architecture really make sense? I think this question is also related to the importance of having "stride > 1", which is supposed to help getting abstract features(high level features).

Best Answer

Yes, it does make sense. But in order to “mix” the features from distant parts of an image and account for context, the network should be either very deep (like ResNet), or have wide convolution kernels (like Dilated convolutions), or use some kind of post-processing (like fully-connected CRF, although it is not very semantic).

The dilated convolutions paper specifically targets at increasing the size of the receptive field for the output of some specific variable on an intermediate layer. To make that tractable, the stride is pushed into convolution kernels. It seems like a powerful idea for accounting for global context, in particular, for semantic segmentation.