Solved – how can the loss suddenly increase while training a CNN for image segmentation

image processingkerasloss-functionsneural networkstensorflow

I work with keras 1.2.2 with a tensorflow 1.4.0 backend.

I'm using a unet architecture, I have 708 images of 650×650 pixels and 6 chanels. I augmented the dataset with mirrorings and rotations, for a total of 4248 images.

I have 2 classes, my loss function is this one :

def jaccard_coef_loss(y_true, y_pred):
    smooth = 1e-12
    intersection = K.sum(y_true * y_pred, axis=[0, -1, -2])
    sum_ = K.sum(y_true + y_pred, axis=[0, -1, -2])
    jac = (intersection + smooth) / (sum_ - intersection + smooth)
    return 1 - K.mean(jac)

my optimizer:

optimizer = SGD(lr=0.01, momentum=0.9, nesterov=True)

I have a validation set of about 30% of the total of images, batch_size of 4, shuffle is set to True. The model goes through every training images at each epoch. 200 epochs are scheduled but learning will stops if there is no improvement on validation set for 10 epochs.

Here are the training logs for the final epochs

Epoch 10/200
4248/4248 [==============================] - 3192s - loss: 0.1388 - acc: 0.0868 - jaccard_coef: 0.8612 - jaccard_coef_int: 0.8613 - val_loss: 0.2957 - val_acc: 0.0536 - val_jaccard_coef: 0.7043 - val_jaccard_coef_int: 0.7043
Epoch 11/200
4248/4248 [==============================] - 3167s - loss: 0.1375 - acc: 0.0901 - jaccard_coef: 0.8625 - jaccard_coef_int: 0.8626 - val_loss: 0.2968 - val_acc: 0.0632 - val_jaccard_coef: 0.7032 - val_jaccard_coef_int: 0.7033
Epoch 12/200
4248/4248 [==============================] - 3272s - loss: 0.1964 - acc: 0.1084 - jaccard_coef: 0.8036 - jaccard_coef_int: 0.8037 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2793e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 13/200
4248/4248 [==============================] - 3112s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 4.6290e-15 - jaccard_coef_int: 5.5532e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 14/200
4248/4248 [==============================] - 2032s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 2.5857e-15 - jaccard_coef_int: 5.1207e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 15/200
4248/4248 [==============================] - 2260s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 2.6600e-15 - jaccard_coef_int: 5.0932e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 16/200
4248/4248 [==============================] - 2914s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 2.3220e-15 - jaccard_coef_int: 4.8916e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 17/200
4248/4248 [==============================] - 2928s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 2.6034e-15 - jaccard_coef_int: 6.3645e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 18/200
4248/4248 [==============================] - 2738s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 2.3913e-15 - jaccard_coef_int: 4.7182e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18
Epoch 19/200
4248/4248 [==============================] - 2922s - loss: 1.0000 - acc: 0.5089 - jaccard_coef: 6.2745e-15 - jaccard_coef_int: 5.0041e-18 - val_loss: 1.0000 - val_acc: 0.5066 - val_jaccard_coef: 1.2659e-15 - val_jaccard_coef_int: 4.7833e-18

I don't know what happend between epochs 12 and 13. Is it my fault or is there a known bug that would be fixed by upgrading to a newer version of keras/tf?

Best Answer

It seems that the gradient has exploded for some reason. Consider using gradient clipping to prevent that. Try gradient clipping using parameters clipnorm or clipvalue by modifying your optimizer definition to:

optimizer = SGD(lr=0.01, momentum=0.9, nesterov=True, clipnorm=1.)

or with

optimizer = SGD(lr=0.01, momentum=0.9, nesterov=True, clipvalue=1.)
Related Question