Solved – Sequence lengths in LSTM / BiLSTMs and overfitting

deep learninglstmmachine learningnatural languageneural networks

I'm currently working with LSTMs and BiLSTMs, using Keras as library (TF backend). Following Tutorials and reading some papers, I found out that the sequences used are mostly quite short.

What I do have on the other hand, are really long pairs with a sentence and a related document (or more than one). For each pair of I have a binary label. I want to learn, if the sentence makes an assumption about the document.

Since the documents are quite long, I checked how many documents I've got on average (around 4) for a related sentence and how long these docs usually are. Based on this I found out, that the trade-off between sequence length for the documents and loosing as few as possible information, is at around 2500 tokens. For the sentence-part I have a length of 100 tokens at max. If I have more than one document, I use 2500/#docs tokens for each document and concatenate them. So I have sequences of 2600 tokens for each sample.

I then create two BiLSTMs, one for the sentence, one for the doc(s) and concatenate their result. I use GloVe embeddings (100d, 400k length). I pad and mask the sequences. I make the embedding trainable, since otherwise there is almost no change in accuracy over the epochs (maybe I need to run the model for a lot more epochs, which I just can't do).

Here is my problem: It takes all the memory on the GPU for a batch_size of 50 already. To run 3400 samples it takes 12min for a single epoch. This makes playing around with the hyperparameters and overall testing quite impossible. I shortened the length of the sequences by the factor 2 (to 1300 tokens), increased batch size to 150, but still it takes a bit more than 3 mins each epoch. I noticed the results get a bit worse, but not much, though I can't really compare the results of higher epochs because of the computation time.

I have 4 questions about this problem:

  1. What is the usual sequence length you use for LSTMs/BiLSTMs? Are my sequences just too long? I could split up pairs into many (shorter) pairs, but there is a relation between the documents which would be lost (I guess) and this would increase the samples of course so it might not make a difference in computation time.
  2. I'm already filtering out stopwords and due to the GloVe embeddings a lot of words get just zero vectors and masked out (though this might not decrease the computation time(?)). What other options do I have for my problem in order to lose as few information as possible? Do you have any suggestions for me?
  3. I get the best results if I stop after only 2 or 3 epochs. The results each epoch vary a lot, but they are almost always best (i.e. for each fold in my 10 fold cross-validation) for the first 2-5 epochs. Then they usually drop very fast. Most probably because I start to overfit on the training data. Still, at this point my model seems to be quite unstable, because of the variations in accuracy/f1. Does this mean I can't really make reliable predictions after that few epochs? I see a lot of examples for LSTMs running for hundreds and thousands of epochs.
  4. The data is quite unblanced (almost 80% for one class). I calculate class weights and use them for training, so the bigger class gets penalized. Does the accuracy output (see below) of 1.0 after only 6 epochs actually mean I perfectly predict the training labels and I overfitted already? It seems odd, that it overfits that fast with that many parameters to learn. (Because of the unbalanced data, I output f1 macro on the validation data after each epoch). I already added a dropout of 0.2 after each BiLSTM.

(output of one fold showing 6 epochs and possible overfitting)

Epoch1: 3405/3405 [==============================] - 184s - loss: 0.6863 - acc: 0.6767 - val_loss: 0.6969 - val_acc: 0.4670 - f1 (macro): 0.453781392694
Epoch2: 3405/3405 [==============================] - 168s - loss: 0.3600 - acc: 0.8394 - val_loss: 0.5692 - val_acc: 0.7625 - f1 (macro): 0.633288896534
Epoch3: 3405/3405 [==============================] - 172s - loss: 0.0423 - acc: 0.9897 - val_loss: 0.7367 - val_acc: 0.7704 - f1 (macro): 0.612433442646
Epoch4: 3405/3405 [==============================] - 173s - loss: 0.0094 - acc: 0.9982 - val_loss: 0.8715 - val_acc: 0.7757 - f1 (macro): 0.582041335288
Epoch5: 3405/3405 [==============================] - 174s - loss: 0.0020 - acc: 0.9997 - val_loss: 0.8572 - val_acc: 0.7836 - f1 (macro): 0.599469017424
Epoch6: 3405/3405 [==============================] - 175s - loss: 9.6727e-04 - acc: 1.0000 - val_loss: 0.9114 - val_acc: 0.7810 - f1 (macro): 0.586315277285
Epoch7: 3405/3405 [==============================] - 173s - loss: 8.9306e-04 - acc: 1.0000 - val_loss: 0.9645 - val_acc: 0.7863 - f1 (macro): 0.590663626545

I'm thankful for any advise!

Best Answer

  1. 250-500 timesteps is the 'norm' if you are not using any tricks to retain memory over very long sequences.

  2. No.

  3. Early stopping is a thing, but your problem seems to be in... how you've formulated your problem and posed it to your network.

  4. 20% target class is likely not an issue.

The way you've posed your problem seems weird, and it also seems like you're trying to something very high level (identifying assumptions) with a single-layer bilstm, which may not be appropriate. Perfect accuracy from a short training time generally means there is some sort of a problem in how the question is being posed or how the data are structured (e.g., leakage). How does your model perform on unseen documents?

I would think through what it is you're trying to force your network to do, and if that is even appropriate. I would also look into document summarization (e.g., doc2vec and using a normal NN, or reducing the document to a sequence of sentence-level meaning vectors, rather than word-level).

To reduce your documents to sequences of word vectors, you train a standard encoder-decoder to reconstruct sentences. You feed it your word vectors, then have the decoder reconstruct those words. Once it's well-trained, you can feed it all your training data (all sentences) and extract the final encoder hidden states for each sentence. Then, each document is a far far shorter sequence of X-length 'thought vectors'. You may also use an approach called 'skip-thought'.