Solved – How to interpret the validation and training loss curve if there is a large difference between the two which closes in sharply

conv-neural-networkconvergenceloss-functionsneural networksoptimization

my CNN is meant to classify an image as one out of around 30 categories. I am training on 6400 samples using a batch size of 128. I am using Keras/ Tensorflow Architecture is Conv + Batch Normalization + Convo + Batch Normalization + MaxPooling2D repeated 4 times.
Using an ADAM optimizer gives me following loss curves, orange is validation blue is training

ADAM optimizer

There is a huge gap between validation and training loss which closes in eventually. I then tried an SGD optimizer

SGD optimizer

Which looked better but I still didn’t understand why training set is able to learn so quickly and validation loss only decreases after awhile. I thought maybe its because my learning rate is too large, causing the training loss to plunge initially. I tried reducing learning rate.

SGD with smaller learning rate 0.005

But doesn't change much. I thought maybe its due to initial overfitting and that I am “lucky” to have the validation loss eventually find its way lower, so I tried adding dropouts to my model

Dropout doesn't really help either

I'm still getting the same issue. I also tried changing the momentum from 0.5 to 0.9

higher momentum at 0.9

What is causing the shape of my loss curve to be like that and how do I tweak it so validation loss drops quicker in sync? I don’t understand why there is a large gap between the two which only closes in after a few epochs.

I'm reluctant to increase my data-set since I've seen other similar projects being done with similar size data. Input image pixels have already been standardized to between 0-1

Edit: I'm posting my architecture below just for more info if it helps, hope someone takes a look cause to me its a real head-scratcher. Using Python, keras

feeder_input = Input(shape=(50, 55, 3))

layer_1_CovNet = Convolution2D(64, (3, 3), activation="relu", padding="same")(feeder_input)
layer_1_CovNet = BatchNormalization()(layer_1_CovNet)
layer_2_CovNet = Convolution2D(64, (3, 3), activation="relu", padding="same")(layer_1_CovNet)
layer_2_CovNet = BatchNormalization()(layer_2_CovNet)
layer_3_CovNetPl = MaxPooling2D()(layer_2_CovNet)

layer_4_CovNet = Convolution2D(128, (3, 3), activation="relu", padding="same")(layer_3_CovNetPl)
layer_4_CovNet = BatchNormalization()(layer_4_CovNet)
layer_5_CovNet = Convolution2D(128, (3, 3), activation="relu", padding="same")(layer_4_CovNet)
layer_5_CovNet = BatchNormalization()(layer_5_CovNet)
layer_6_CovNetPl = MaxPooling2D()(layer_5_CovNet)

layer_7_CovNet = Convolution2D(256, (3, 3), activation="relu", padding="same")(layer_6_CovNetPl)
layer_7_CovNet = BatchNormalization()(layer_7_CovNet)
layer_8_CovNet = Convolution2D(256, (3, 3), activation="relu", padding="same")(layer_7_CovNet)
layer_8_CovNet = BatchNormalization()(layer_8_CovNet)
layer_9_CovNetPl = MaxPooling2D()(layer_8_CovNet)

layer_10_CovNet = Convolution2D(512, (3, 3), activation="relu", padding="same")(layer_9_CovNetPl)
layer_10_CovNet = BatchNormalization()(layer_10_CovNet)
layer_11_CovNet = Convolution2D(512, (3, 3), activation="relu", padding="same")(layer_10_CovNet)
layer_11_CovNet = BatchNormalization()(layer_11_CovNet)
layer_12_CovNetPl = MaxPooling2D()(layer_11_CovNet)

ConvNetFlatten = Flatten()(layer_12_CovNetPl)

encoder_dense1 = Dense(512, activation='relu', name='encoder_dense1')
encoder_dense1_output = encoder_dense1(ConvNetFlatten)

totext_dense1 = Dense(len(char_dict), activation='softmax', name='totext_dense1')
totext_output1 = totext_dense1(encoder_dense1_output)

model_train = Model(feeder_input, totext_output1) 

Best Answer

Validation loss is indeed expected to decrease as the model learns and increase later as the model begins to overfit on the training set. One reason why your training and validation set behaves so different could be that they are indeed partitioned differently and the base distributions of the two are different. Did you shuffle before partitioning? If the validation set has too few samples, naturally, its behaviour may be erratic.