Solved – Loss function for semantic segmentation

conv-neural-networkimage processingimage segmentation

Apologizes for misuse of technical terms. I am working on a project of semantic segmentation via convolutional neural networks (CNNs) ; trying to implement an architecture of type Encoder-Decoder, therefore output is the same size as the input.

How do you design the labels ? What loss function should one apply ? Especially in the situation of heavy class imbalance (but the ratio between the classes is variable from image to image).

The problem deals with two classes (objects of interest and background). I am using Keras with tensorflow backend.

So far, I am going with designing expected outputs to be the same dimensions as the input images, applying pixel-wise labeling. Final layer of model has either softmax activation (for 2 classes), or sigmoid activation ( to express probability that the pixels belong to the objects class). I am having trouble with designing a suitable objective function for such a task, of type:

function(y_pred,y_true),

in agreement with Keras.

Please, try to be specific with the dimensions of tensors involved (input/output of the model). Any thoughts and suggestions are much appreciated.

Best Answer

Cross entropy is definitely the way to go. I don't know Keras but TF has this: https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits

Here is a paper directly implementing this: Fully Convolutional Networks for Semantic Segmentation by Shelhamer et al.

The U-Net paper is also a very successful implementation of the idea, using skip connections to avoid loss of spatial resolution. You can find many implementations of this in the net.

From my personal experience, you might want to start with a simple encoder-decoder network first, but do not use strides (or strides=1), otherwise you lose a lot of resolution because the upsampling is not perfect. Go with small kernel sizes. I don't know your specific application but even a 2-3 hidden layer network will give very good results. Use 32-64 channels at each layer. Start simple, 2 hidden layers, 32 channels each, 3x3 kernels, stride=1 and experiment with parameters in an isolated manner to see their effect. Keep the dimensions always equal to the input dimension for starters to avoid resolution loss. Afterwards you can switch on strides and upsampling and implement ideas like U-Net. U-Net works extremely well for medical image segmentation.

For class-inbalance see https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/ Here the idea is to weight the different classes with $\alpha$ and $\beta$ parameters.

Related Question