Solved – Is this a sign of (bad) local minima encountered by this RNN

overfittingrecurrent neural network

On a binary classification task, I can get perfect performance on the training set, but no matter how strongly I regularize the recurrent neural network (dropout 0.99, L2 weight penalties of 0.01) the generalization performance on a validation set is poor.
What strikes me as unusual, is that the accuracy on the validation set always seem to get stuck at the same value of 76%. No matter which architecture or regularization alternative, this is a threshold that my model does not seem to be able to overcome.

Can this be a general local minima that is being reached, although the training set performance is pretty much perfect? (loss ~0, accuracy 99%)

Below one example of several similar looking ones on training/validation loss and accuracy: Note that a plateau is reached around 0.7, after which the training set performance (blue) jumps while the validation set (red) gets worse again (overfitting).

enter image description here

On the network: For the image, a single layer RNN is used, with 100 LSTM units. Targets are binary labels {0,1}, class balanced. Loss function is the softmax_cross_entropy_with_logits, using Adam as optimization algorithm.

Network:

model = Sequential()
model.add(Bidirectional(LSTM(800, return_sequences=False),
                    input_shape=(window_size, num_input)))
model.add(Dropout(0.6))
model.add(Dense(2))
model.add(Activation('softmax'))
adam = Adam(lr=5e-6, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss='categorical_crossentropy', optimizer=adam,metrics=['accuracy'])

Training with Bidirectional LSTM in Keras. Hidden_Units = 200, Dropout = 0.95
Training with Bidirectional LSTM in Keras. Hidden_Units = 200, Dropout = 0.95

Best Answer

I think more the validation loss diverging at 500 epochs in the plot you have is more noticeable than the validation accuracy plateauing. I would recommend reducing the number of hidden units and see if it changes anything, in case you have not tried this already. I would set my first objective to reach similar loss and accuracy on train and validation and then try to improve both together. If none of these helped, I would check the train/validation split. Are you shuffling the data? Another reason for the performance above could be different distribution of training and validation sets.

Edit (based on new results):

I think this is a better start now. Usually the dropout values I have seen are .2-.5. 0.95 seems to much. If a lower dropout overfits, reduce the hidden units to 100.

Here is what I might approach this. Fix the # of epochs to maybe 100, then reduce the hidden units so that after 100 epochs you get the same accuracy on training and validation, although this might be as low as 65%. This is what I call a good start.

From here, I'll try these maybe:

  1. Start increasing the hidden units. (We know this start overfitting from your data, so go to option 2.)
  2. Add more layers, starting with few hidden units. Maybe 10 or so at each layer.
  3. Try other forms of RNNs like GRU.
  4. Tweak the # of observations, lower values may not have enough information, higher values might be tough to run, taking more time and still not capturing the long-term dependencies.
  5. If you are dealing with time-series data, not sequences like text, try applying pre-processing techniques like spectrogram and see if that helps.
  6. Take a look at Attention layers. See if you can leverage those with your model.