Solved – Best Validation accuracy occurs early on in the training process

deep learningmachine learningtime seriesvalidation

I am working with time series and exploring two dimensional representations to train them using a CNN. The current 2D representation is 256 X 256. I have used ResNet-50, InceptionResnetV2 and another custom network (consisting of custom residual blocks) so far (in Keras).
Its a two class classification problem and I am using binary cross entropy as loss function with sigmoid activation in the last layer.

During training, I am saving the best validation accuracy, but surprisingly it always occurs in the first 10 epochs when the network's training accuracy is between 70% and 80%.
The best validation is sometimes slightly lesser than the corresponding train acc or slightly more (like in the output below).
Here is a sample output for the custom network:

Batch size: 100, Epochs: 80, Optimizer: RMSprop (lr = 1e-4)

Train on 18096 samples, validate on 3037 samples
Epoch 1/80
18096/18096 [==============================] - 88s 5ms/step - loss: 0.6931 - acc: 0.6574 - val_loss: 0.6878 - val_acc: 0.5924

Epoch 00001: val_acc improved from -inf to 0.59236, saving model to .checkpoints/weights-bc-rms-msi-01-0.59.hdf5
Epoch 2/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.6331 - acc: 0.6901 - val_loss: 0.7515 - val_acc: 0.5150

Epoch 00002: val_acc did not improve
Epoch 3/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.6017 - acc: 0.7031 - val_loss: 0.5292 - val_acc: 0.7600

Epoch 00003: val_acc improved from 0.59236 to 0.75996, saving model to ./checkpoints/weights-bc-rms-msi-03-0.76.hdf5
Epoch 4/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.5757 - acc: 0.7257 - val_loss: 1.0358 - val_acc: 0.4771

Epoch 00004: val_acc did not improve
Epoch 5/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.5416 - acc: 0.7445 - val_loss: 0.6376 - val_acc: 0.6638

Epoch 00005: val_acc did not improve
Epoch 6/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.5114 - acc: 0.7633 - val_loss: 0.5858 - val_acc: 0.7116

Epoch 00006: val_acc did not improve
Epoch 7/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.4832 - acc: 0.7819 - val_loss: 0.5902 - val_acc: 0.7023

Epoch 00007: val_acc did not improve
Epoch 8/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.4599 - acc: 0.7950 - val_loss: 0.6082 - val_acc: 0.6918

Epoch 00008: val_acc did not improve
Epoch 9/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.4314 - acc: 0.8139 - val_loss: 0.6166 - val_acc: 0.6984

Epoch 00009: val_acc did not improve
Epoch 10/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.4139 - acc: 0.8210 - val_loss: 0.6083 - val_acc: 0.7116

Epoch 00010: val_acc did not improve
Epoch 11/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3915 - acc: 0.8333 - val_loss: 0.6815 - val_acc: 0.6898

Epoch 00011: val_acc did not improve
Epoch 12/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3706 - acc: 0.8433 - val_loss: 1.1184 - val_acc: 0.5555

Epoch 00012: val_acc did not improve
Epoch 13/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3672 - acc: 0.8450 - val_loss: 0.6673 - val_acc: 0.6678

Epoch 00013: val_acc did not improve
Epoch 14/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3333 - acc: 0.8636 - val_loss: 0.7352 - val_acc: 0.6997

Epoch 00014: val_acc did not improve
Epoch 15/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3310 - acc: 0.8644 - val_loss: 0.8746 - val_acc: 0.6118

Epoch 00015: val_acc did not improve
Epoch 16/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3188 - acc: 0.8687 - val_loss: 0.6981 - val_acc: 0.7040

Epoch 00016: val_acc did not improve
Epoch 17/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.3076 - acc: 0.8764 - val_loss: 0.6983 - val_acc: 0.6964

Epoch 00017: val_acc did not improve
Epoch 18/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2797 - acc: 0.8876 - val_loss: 0.7346 - val_acc: 0.7053

Epoch 00018: val_acc did not improve
Epoch 19/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2805 - acc: 0.8867 - val_loss: 0.7565 - val_acc: 0.7089

Epoch 00019: val_acc did not improve
Epoch 20/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2649 - acc: 0.8930 - val_loss: 1.6109 - val_acc: 0.5393

Epoch 00020: val_acc did not improve
Epoch 21/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2559 - acc: 0.9000 - val_loss: 1.7837 - val_acc: 0.5100

Epoch 00021: val_acc did not improve
Epoch 22/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2537 - acc: 0.8979 - val_loss: 0.7572 - val_acc: 0.7066

Epoch 00022: val_acc did not improve
Epoch 23/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2433 - acc: 0.9063 - val_loss: 1.0345 - val_acc: 0.6016

Epoch 00023: val_acc did not improve
Epoch 24/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2369 - acc: 0.9083 - val_loss: 0.9925 - val_acc: 0.6450

Epoch 00024: val_acc did not improve
Epoch 25/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2226 - acc: 0.9125 - val_loss: 2.0245 - val_acc: 0.4988

Epoch 00025: val_acc did not improve
Epoch 26/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2160 - acc: 0.9161 - val_loss: 0.9964 - val_acc: 0.6780

Epoch 00026: val_acc did not improve
Epoch 27/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2123 - acc: 0.9161 - val_loss: 1.1838 - val_acc: 0.6082

Epoch 00027: val_acc did not improve
Epoch 28/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.2043 - acc: 0.9228 - val_loss: 1.2759 - val_acc: 0.5960

Epoch 00028: val_acc did not improve
Epoch 29/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1952 - acc: 0.9270 - val_loss: 0.8388 - val_acc: 0.6958

Epoch 00029: val_acc did not improve
Epoch 30/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1964 - acc: 0.9249 - val_loss: 1.8330 - val_acc: 0.5815

Epoch 00030: val_acc did not improve
Epoch 31/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1922 - acc: 0.9256 - val_loss: 0.9603 - val_acc: 0.6678

Epoch 00031: val_acc did not improve
Epoch 32/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1874 - acc: 0.9296 - val_loss: 0.8144 - val_acc: 0.7244

Epoch 00032: val_acc did not improve
Epoch 33/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1794 - acc: 0.9326 - val_loss: 1.2532 - val_acc: 0.6078

Epoch 00033: val_acc did not improve
Epoch 34/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1758 - acc: 0.9364 - val_loss: 0.9097 - val_acc: 0.6329

Epoch 00034: val_acc did not improve
Epoch 35/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1673 - acc: 0.9365 - val_loss: 2.5517 - val_acc: 0.5094

Epoch 00035: val_acc did not improve
Epoch 36/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1725 - acc: 0.9368 - val_loss: 1.9261 - val_acc: 0.5660

Epoch 00036: val_acc did not improve
Epoch 37/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1622 - acc: 0.9395 - val_loss: 1.0303 - val_acc: 0.6651

Epoch 00037: val_acc did not improve
Epoch 38/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1570 - acc: 0.9431 - val_loss: 0.9959 - val_acc: 0.6780

Epoch 00038: val_acc did not improve
Epoch 39/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1519 - acc: 0.9447 - val_loss: 1.7931 - val_acc: 0.6154

Epoch 00039: val_acc did not improve
Epoch 40/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1532 - acc: 0.9460 - val_loss: 1.1030 - val_acc: 0.6964

Epoch 00040: val_acc did not improve
Epoch 41/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1507 - acc: 0.9466 - val_loss: 1.5248 - val_acc: 0.5687

Epoch 00041: val_acc did not improve
Epoch 42/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1437 - acc: 0.9488 - val_loss: 1.7067 - val_acc: 0.5706

Epoch 00042: val_acc did not improve
Epoch 43/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1407 - acc: 0.9503 - val_loss: 1.2817 - val_acc: 0.6414

Epoch 00043: val_acc did not improve
Epoch 44/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1442 - acc: 0.9500 - val_loss: 1.1254 - val_acc: 0.6391

Epoch 00044: val_acc did not improve
Epoch 45/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1335 - acc: 0.9542 - val_loss: 1.0077 - val_acc: 0.6529

Epoch 00045: val_acc did not improve
Epoch 46/80

18096/18096 [==============================] - 79s 4ms/step - loss: 0.1425 - acc: 0.9507 - val_loss: 1.3959 - val_acc: 0.6355

Epoch 00046: val_acc did not improve
Epoch 47/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1299 - acc: 0.9558 - val_loss: 1.4609 - val_acc: 0.6266

Epoch 00047: val_acc did not improve
Epoch 48/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1282 - acc: 0.9555 - val_loss: 1.2340 - val_acc: 0.6315

Epoch 00048: val_acc did not improve
Epoch 49/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1308 - acc: 0.9547 - val_loss: 1.0206 - val_acc: 0.6694

Epoch 00049: val_acc did not improve
Epoch 50/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1262 - acc: 0.9568 - val_loss: 1.9561 - val_acc: 0.5505

Epoch 00050: val_acc did not improve
Epoch 51/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1248 - acc: 0.9570 - val_loss: 1.1792 - val_acc: 0.6717

Epoch 00051: val_acc did not improve
Epoch 52/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1142 - acc: 0.9613 - val_loss: 1.1790 - val_acc: 0.6421

Epoch 00052: val_acc did not improve
Epoch 53/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1191 - acc: 0.9605 - val_loss: 1.1972 - val_acc: 0.6332

Epoch 00053: val_acc did not improve
Epoch 54/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1215 - acc: 0.9591 - val_loss: 1.3862 - val_acc: 0.6329

Epoch 00054: val_acc did not improve
Epoch 55/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1113 - acc: 0.9616 - val_loss: 2.2359 - val_acc: 0.5420

Epoch 00055: val_acc did not improve
Epoch 56/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1112 - acc: 0.9630 - val_loss: 2.2695 - val_acc: 0.5920

Epoch 00056: val_acc did not improve
Epoch 57/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1125 - acc: 0.9619 - val_loss: 1.1923 - val_acc: 0.6796

Epoch 00057: val_acc did not improve
Epoch 58/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1071 - acc: 0.9663 - val_loss: 1.4360 - val_acc: 0.6240

Epoch 00058: val_acc did not improve
Epoch 59/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1103 - acc: 0.9644 - val_loss: 1.2005 - val_acc: 0.6790

Epoch 00059: val_acc did not improve
Epoch 60/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1025 - acc: 0.9676 - val_loss: 1.7558 - val_acc: 0.5657

Epoch 00060: val_acc did not improve
Epoch 61/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1011 - acc: 0.9672 - val_loss: 1.4701 - val_acc: 0.6197

Epoch 00061: val_acc did not improve
Epoch 62/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.1059 - acc: 0.9655 - val_loss: 1.6352 - val_acc: 0.6072

Epoch 00062: val_acc did not improve
Epoch 63/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0992 - acc: 0.9682 - val_loss: 1.1573 - val_acc: 0.7069

Epoch 00063: val_acc did not improve
Epoch 64/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0991 - acc: 0.9687 - val_loss: 1.4538 - val_acc: 0.5854

Epoch 00064: val_acc did not improve
Epoch 65/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0966 - acc: 0.9694 - val_loss: 1.1501 - val_acc: 0.6958

Epoch 00065: val_acc did not improve
Epoch 66/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0940 - acc: 0.9700 - val_loss: 1.8419 - val_acc: 0.5825

Epoch 00066: val_acc did not improve
Epoch 67/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0909 - acc: 0.9707 - val_loss: 1.4666 - val_acc: 0.6984

Epoch 00067: val_acc did not improve
Epoch 68/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0915 - acc: 0.9712 - val_loss: 1.3062 - val_acc: 0.6783

Epoch 00068: val_acc did not improve
Epoch 69/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0888 - acc: 0.9723 - val_loss: 3.8147 - val_acc: 0.4992

Epoch 00069: val_acc did not improve
Epoch 70/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0923 - acc: 0.9717 - val_loss: 1.6840 - val_acc: 0.6572

Epoch 00070: val_acc did not improve
Epoch 71/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0851 - acc: 0.9735 - val_loss: 1.5529 - val_acc: 0.6543

Epoch 00071: val_acc did not improve
Epoch 72/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0894 - acc: 0.9719 - val_loss: 2.2659 - val_acc: 0.5400

Epoch 00072: val_acc did not improve
Epoch 73/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0919 - acc: 0.9735 - val_loss: 1.4676 - val_acc: 0.6115

Epoch 00073: val_acc did not improve
Epoch 74/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0892 - acc: 0.9724 - val_loss: 1.7677 - val_acc: 0.6342

Epoch 00074: val_acc did not improve
Epoch 75/80
18096/18096 [==============================] - 79s 4ms/step - loss: 0.0877 - acc: 0.9726 - val_loss: 2.5043 - val_acc: 0.5578

Epoch 00075: val_acc did not improve
Epoch 76/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0843 - acc: 0.9738 - val_loss: 1.9680 - val_acc: 0.5673

Epoch 00076: val_acc did not improve
Epoch 77/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0800 - acc: 0.9756 - val_loss: 1.4121 - val_acc: 0.6888

Epoch 00077: val_acc did not improve
Epoch 78/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0820 - acc: 0.9751 - val_loss: 1.7393 - val_acc: 0.6572

Epoch 00078: val_acc did not improve
Epoch 79/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0829 - acc: 0.9744 - val_loss: 1.4969 - val_acc: 0.6602

Epoch 00079: val_acc did not improve
Epoch 80/80
18096/18096 [==============================] - 80s 4ms/step - loss: 0.0808 - acc: 0.9755 - val_loss: 1.9161 - val_acc: 0.6154

Epoch 00080: val_acc did not improve
CPU times: user 4h 57min, sys: 27min 26s, total: 5h 24min 27s
Wall time: 1h 45min 39s

The test accuracy after 80 epochs is 61% while the test accuracy corresponding the weights of the best val accuracy is 41.1%.
My questions:

  1. Is it okay to use weights corresponding to the
    best validation even if it occurs very early on in the training
    process? Or is my network under fitting?
  2. Any suggestions on how to correct this issue?

Best Answer

You didn't mention you train strategy, assuming that you use pretrained ImageNet weights:

  1. You should first freeze everything but your last added dense layer and train it for some epochs (say 10).
  2. Then you might want to train only 1/2 or 1/4 of all layers counting from the last one. Especially if your data is natural images. The goal is to fine-tune abstract layers without overfitting the network.

Even if you train it from the scratch, from provided log it's quite clear that your model just overfits the training data without learning generalizable features.

To prevent that:

  1. Use heavy data augmentation
  2. Add weight decay to all layers and set it to ~ 1e-4 or 1e-5
  3. As mentioned above, decrease the number of trainable layers, freezing them from left to right if you use pre-trained weights.