Time Series – Stopping Exploding Gradients in Keras

lstmtheanotime series

I have a LSTM (Long-Short Term Memory) Neural Network that has this structure:

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(271,2)))
adam = Adam(lr=.0000001, clipnorm = .001)
model.add(LSTM(271, activation = 'linear', input_shape=(271,2))) #return_sequences = True
#model.add(LSTM(3, activation = 'hard_sigmoid', inner_activation = 'hard_sigmoid'))
#model.add(Dense(1, activation = 'linear'))
model.compile(loss='mean_squared_logarithmic_error', optimizer=adam)
model.fit(maskingreg, maskingresp, nb_epoch=50, batch_size=500, verbose=2)

As you can see, I am clipping gradients and making the learning rate minuscule. However when the activation is linear, the net returns NaNs for the loss at every training epoch but the first. Does anyone have any other ideas why this is happening or even if there were more in depth troubleshooters in Keras to figure out why the loss is NaN?

Additionally when I use a hard sigmoid activation so that the min and max is capped, the net doesn't spit out NaNs but does not perform well. Obviously, this is not the main usage of a hard sigmoid, which is why I would prefer to stick with linear activation.

One problem I considered is that I'm doing masking incorrectly and this is causing a very low rank tensor. I have 52 samples, 271 time steps, and 2 features. I assumed that if I set both of the two features along any time step and any sample to 0. (with mask_value = 0.0), it would skip those features, but would read other samples in the same time step and any features in previous or additional time steps along the same sample. I assumed this was right, but masking seemed pretty confusing so I'm not positive.

Best Answer

I had this problem before and one solution that worked for me was to use weight and activation regularization, specifically l2 regularization. Even setting the alpha values very small (0.000001) helped a lot without compromising accuracy too much.

For your code, add the following import statement:

from keras.regularizers import l2

And change line 4 from your code to the following:

model.add(LSTM(271, activation = 'linear', input_shape=(271,2), 
               kernel_regularizer=l2(0.0000001), 
               activity_regularizer=l2(0.0000001)))

Another thing that I noted from my experience is that if you stack many LSTM layers with linear or ReLU activation layers this solution becomes less robust. You may need to play around with the regularization alpha constants as well as the learning rate and norm clipping to get things to work properly for your particular problem.

Related Solutions

Solved – Why is the LSTM +- 1DConvNet so ineffective at waveform analysis

I would suggest framing this as a classification problem and outputting 2 softmaxes each with size 300. This usually works better than the continuous output approach you have taken here.

You might expect this approach to work better, because in order for the LSTM to successfully execute the original regression approach, it would have to detect the onset, and then somehow pass down that information several hundred time-steps. In addition, there would probably have to be a counter-like mechanism embedded in the LSTM weights in order to figure out exactly where the deteted onset was. This is all super difficult for an LSTM to learn to do.

Also for that reason, I don't recommend just taking the hidden vector from the last time-step of LSTM and getting the output from that -- instead, try doing something with the full sequence of hidden states (flatten them or something).

Solved – Keras TimeSeries – Regression with negative values

From what you have described and posted I see a few things that you could improve. Please see them below.

Are you normalizing your input/output variable? From what you posted it doesn't look like you are, correct me if I am wrong. If you are not, you definitely need to. DNN have issues when dealing with non-normalized data because of variable weights. Say you have two input features, one 100 the other 10. Apply similar weights and the feature with the larger nominal value will end up having a larger weight in the model. You can try [0,1] or [-1,1] range normalization. Although you can always Z-Score normalize too. I prefer the range normalization because they have similar max / min values. Find the calculations for range normalization below.

$\ [0,1]\ Norm = \frac{x - min(x)}{max(x)-min(x)} $

$\ [-1,1]\ Norm = 2*\frac{x - min(x)}{max(x)-min(x)}-1 $

Futhermore, try adding a dropout layer or a batch normalization layer. These reduce overfitting of the network and are generally a good idea to include in any network.

I noticed you are only using 2 epochs to train the model. This is a really low number. You aren't giving the model enough tries at refining the problem. Try a higher amount of epochs too. Adding more epochs is also useful for identify if the model is underfitting or overfitting as you can plot the loss metric over the epochs.

This is a very surface level analysis. If you need more detail let me know.

Update

I had the very same questions you had. As a background, most of my predictive experience is in finance/trading where direction and magnitude is important. For most of my models I use [-1,1]. If find it works the best, purely my experience. To solve the direction/magnitude issue, you must consider the loss function you are using. If model predicts 1 and true data is -2, error will be 3 for that point based off a simple residual. Based off this, you should be able to convince yourself that direction is but magnitude. So using [0,1] norm shouldn't matter because the magnitude from 1 to -2 will remain the same when normalized, given the units will change.

For normalization, you just apply the inverse of the function to "de-normalize" the data. In theory, yes you could have an issue if you are predicting beyond the bounds of max(x) and min(x), but if the value predicted is within that range there will be no problem. You could just normalize the input vectors and leave the output vector untouched. The value normalization adds to the model is worth this risk. The important feature of normalizing, as said before, is all the input vectors become equivalent in units. The fact that some of your input vectors are normalized is good, but they all need to have the same normalization for you to gain the benefits.

Update 2

I'm not to familiar with CNNs, more RNNs (GRU / LSTM). So I'm not sure if there is an error in your architecture. From my research and experience, I have found RNNs very useful when it comes to time-series forecasting because of the recurrent aspects of them. The ability for them to handle temporal data is a huge benefit in my book. CNNs, from what I know, do not have this feature. Here is a link chatting about them both:

https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/

As far as a loss function, test both. Check performance. Here are a few links that compares and contrasts the loss functions available:

https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

https://people.duke.edu/~rnau/compare.htm

https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0

One last thing. Not all problems are solvable by machine learning, especially in time-series. The data could just be too noisy for the model to work. Try forecasting the SPY time-series, you will most likey come up with a subpar model. It is just too noisy. So when in doubt, start with a simple model and when it is shown not to be overfitting increase the complexity.