Solved – a simplified version of fully convolutional network

computer visionconv-neural-networkdeep learningmachine learning

In the paper of fully convolutional networks semantic segmentation, the authors adopts up-sampling (de-convolutional network) to recover the feature maps with reduced dimensions (due to the multiple layers of down-sampling) to the original size.

If we do not do any down-sampling, i.e., using stride 1 in convolutional layer and pooling layer, and thus keep the image size after multiple layers of convolutions, can we just do the pixel-pixel semantic segmentation on the feature map of this layer without having to resort to upsampling as proposed by the paper.

I can imagine the performance can be worse, but I want to know whether this architecture really make sense? I think this question is also related to the importance of having "stride > 1", which is supposed to help getting abstract features(high level features).

Best Answer

Yes, it does make sense. But in order to “mix” the features from distant parts of an image and account for context, the network should be either very deep (like ResNet), or have wide convolution kernels (like Dilated convolutions), or use some kind of post-processing (like fully-connected CRF, although it is not very semantic).

The dilated convolutions paper specifically targets at increasing the size of the receptive field for the output of some specific variable on an intermediate layer. To make that tractable, the stride is pushed into convolution kernels. It seems like a powerful idea for accounting for global context, in particular, for semantic segmentation.

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

    28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)

My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.
My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.
As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?
I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.

Solved – kernel size and stride value for fully convolutional network for semantic segmentation

fcn32 uses a stride of 32 because after pool5 the spatial resolution is 2^5=32 times smaller, similarly pool4 should use a stride of 2^4=16 and pool3 should use a stride of 2^3=8.

That tensorflow model first uses stride 2 to upsample pool5 to the same size as pool4, then uses another stride 2 to upsample these two to be the same size as pool3, and finally upsamples these three all together with stride 8. So pool5 gets enlarged 2*2*8=32 times, pool4 gets enlarged 2*8=16 times and pool3 8 times, which is correct.

The reason of doing it this way instead of using strides of 8 16 and 32 separately for each layer is to save the amount of computations. As in the end we are summing them together it's more efficient to sum before upsampling than after.

The kernel size seems more of a design choice, though it is not mentioned in the paper directly, the paper provides a link to their Caffe implementation, which uses a kernel size of 4 for stride 2 layers.

Best Answer

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

Solved – kernel size and stride value for fully convolutional network for semantic segmentation

Related Question