Time Series – Stopping Exploding Gradients in Keras

lstmtheanotime series

I have a LSTM (Long-Short Term Memory) Neural Network that has this structure:

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(271,2)))
adam = Adam(lr=.0000001, clipnorm = .001)
model.add(LSTM(271, activation = 'linear', input_shape=(271,2))) #return_sequences = True
#model.add(LSTM(3, activation = 'hard_sigmoid', inner_activation = 'hard_sigmoid'))
#model.add(Dense(1, activation = 'linear'))
model.compile(loss='mean_squared_logarithmic_error', optimizer=adam)
model.fit(maskingreg, maskingresp, nb_epoch=50, batch_size=500, verbose=2)

As you can see, I am clipping gradients and making the learning rate minuscule. However when the activation is linear, the net returns NaNs for the loss at every training epoch but the first. Does anyone have any other ideas why this is happening or even if there were more in depth troubleshooters in Keras to figure out why the loss is NaN?

Additionally when I use a hard sigmoid activation so that the min and max is capped, the net doesn't spit out NaNs but does not perform well. Obviously, this is not the main usage of a hard sigmoid, which is why I would prefer to stick with linear activation.

One problem I considered is that I'm doing masking incorrectly and this is causing a very low rank tensor. I have 52 samples, 271 time steps, and 2 features. I assumed that if I set both of the two features along any time step and any sample to 0. (with mask_value = 0.0), it would skip those features, but would read other samples in the same time step and any features in previous or additional time steps along the same sample. I assumed this was right, but masking seemed pretty confusing so I'm not positive.

Best Answer

I had this problem before and one solution that worked for me was to use weight and activation regularization, specifically l2 regularization. Even setting the alpha values very small (0.000001) helped a lot without compromising accuracy too much.

For your code, add the following import statement:

from keras.regularizers import l2

And change line 4 from your code to the following:

model.add(LSTM(271, activation = 'linear', input_shape=(271,2), 
               kernel_regularizer=l2(0.0000001), 
               activity_regularizer=l2(0.0000001)))

Another thing that I noted from my experience is that if you stack many LSTM layers with linear or ReLU activation layers this solution becomes less robust. You may need to play around with the regularization alpha constants as well as the learning rate and norm clipping to get things to work properly for your particular problem.