LSTM Models – Effective Strategies for Dealing with LSTM Overfitting in Neural Networks

lstmmodel-evaluationneural networksoverfittingrecurrent neural network

I'm carrying out a project of predicting time series data with an LSTM.
I tried out the experiment three times with randomly sampled data(about 920,000 lines each)

I've stacked 3 layers of LSTM cells,
used l1(0.01) regularization,
used dropout,
tried shuffling the dataset for every epoch,
used ADAM optimizer..

but I get the error curve as follows, which seems to signify overfitting

x-axis : epochs

y-axis : error in terms of mean squared error

the blue line indicates the test set, and the orange train set

3 experiments

Can somebody give suggestions on what I should try?
Maybe it's a matter on the dataset itself?

Best Answer

Your NN is not necessarily overfitting. Usually, when it overfits, validation loss goes up as the NN memorizes the train set, your graph is definitely not doing that. The mere difference between train and validation loss could just mean that the validation set is harder or has a different distribution (unseen data). Also, I don't know what the error means, but maybe 0.15 is not a big difference, and it is just a matter of scaling.

As a suggestion, you could try a few things that worked for me:

  1. Add a small dropout to your NN (start with 0.1, for example);
  2. You can add dropout to your RNN, but it is trickier, you have to use the same mask for every step, instead of a random mask for each step;
  3. You could experiment with NN size, maybe the answer is not making it smaller, but actually bigger, so your NN can learn more complex functions. To know if it is underfitting or overfitting, try to plot predict vs real;
  4. You could do feature selection/engineering -- try to add more features or remove the ones that you might think that are just adding noise;
  5. If your NN is simply input -> rnn layers -> output, try adding a few fully connected layers before/after the rNN, and use MISH as an activation function, instead of ReLU;
  6. For the optimizer, instead of Adam, try using Ranger.
  7. The problem could be the loss function. Maybe your labels are very sparse (a lot of zeros), and the model learns to predict all zeros (sudden drop in the beginning) and cant progress further after that. To solve situations like that you can try different metric, like pos_weight on BCE, dice loss, focal loss, etc.

Good luck!

Related Question