Solved – the purpose of the last 1×1 convolution layer in segmentation networks providing a linear transformation of the features

conv-neural-networkconvolutionimage segmentationneural networks

Semantic segmentation networks make use of a final 1×1 convolution layer at the very end of their network which brings the feature maps equal to the number of classes in the dataset. Since this 1×1 convolution operation is never followed by an activation function like the ReLU, then this final operation would be a linear transformation of the data.

I know why a kernel size of 1 is used, its because they are great dimensionality reducers and are good at effectively reducing the number of channels equal to the number of classes. My question is related to why a linear transformation is used here whereas every other convolution operation in the network is followed by an activation function making them non-linear.

I thought neural networks learn nothing from linear transformations? (the top answer here) And the purpose of the final 1×1 convolution layer is only for dimensionality reduction. But in a network I am playing with I could not get a good result until I stacked multiple convolution blocks at the end. i.e.

conv(3x3) -> conv(3x3) -> conv(1x1) -> output

By doing so my result greatly improved the final output and helped the network train as well. No activation function was used here so I'm just stacking linear transformations.

Why did the network result improve?

Best Answer

Neural networks for semantic segmentation use a 1x1 convolution followed by a softmax in the end for classification, not for dimensionality reduction. It works just like a fully connected layer in a classification convnet, just replicated for each pixel. That is exactly what segmentation is: classification of every pixel.

My question is related to why a linear transformation is used here whereas every other convolution operation in the network is followed by an activation function making them non-linear.

My guess is you overlooked a softmax function following the last layer, it makes no sense to use linear outputs for segmentation.

Why did the network result improve?

The fact that stacking linear functions does not change the overall expressiveness does not mean that the learning behavior is the same. Multi-layered linear networks have different learning dynamics than directly optimizing a single linear transformation. It is only expected to observe different outcomes.

Related Question