Solved – Activation function between LSTM layers

keraslstmneural networksrecurrent neural networktensorflow

I'm aware the LSTM cell uses both sigmoid and tanh activation functions internally, however when creating a stacked LSTM architecture does it make sense to pass their outputs through an activation function (e.g. ReLU)?

So do we prefer this:

model = LSTM(100, activation="relu", return_sequences=True, input_shape(timesteps, n_features))
model = LSTM(50, activation="relu", return_sequences=True)(model)
...

over this?

model = LSTM(100, return_sequences=True, input_shape(timesteps, n_features))
model = LSTM(50, return_sequences=True)(model)
...

From my empirical results when creating an LSTM-autoencoder I've found them to be quite similar.

Thanks!

Best Answer

The purpose of the Rectified Linear Activation Function (or ReLU for short) is to allow the neural network to learn nonlinear dependencies.

Specifically, the way this works is that ReLU will return input directly if the value is greater than 0. If less than 0, then 0.0 is simply returned. The idea is to allow the network to approximate a linear function when necessary, with the flexibility to also account for nonlinearity. This article from Machine Learning Mastery goes into more detail on the same.

As for whether having an activation function would make much difference to the analysis, much of this depends on the data. Given that ReLUs can have quite large outputs, they have traditionally been regarded as inappropriate for use with LSTMs.

Let’s consider the following example. Suppose an LSTM is being used as a time series tool to forecast weekly fluctuations in hotel cancellations (all values in the time series are positive, as the number of cancellations cannot be negative). The network structure is as follows:

# Generate LSTM network
model = tf.keras.Sequential()
model.add(LSTM(4, input_shape=(1, previous)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history=model.fit(X_train, Y_train, validation_split=0.2, epochs=500, batch_size=1, verbose=2)

When the predictions are compared with the test data, the following readings are obtained:

  • Mean Directional Accuracy: 80%
  • Root Mean Squared Error: 92
  • Mean Forecast Error: 29

Now, suppose that a ReLU activation function is invoked:

# Generate LSTM network
model = tf.keras.Sequential()
model.add(LSTM(4, activation="relu", input_shape=(1, previous)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history=model.fit(X_train, Y_train, validation_split=0.2, epochs=500, batch_size=1, verbose=2)
  • Mean Directional Accuracy: 80%
  • Root Mean Squared Error: 96.78
  • Mean Forecast Error: 9.40

We see better performance on MFE and slightly worse performance for RMSE. That said, note the difference between the two graphs:

Predictions without ReLU

predictions without ReLU

Predictions with ReLU

Predictions with ReLU

We can see that the predictions with ReLU flatten out the volatility in the time series. While this might result in better performance on some metrics (in this case RMSE), it also means that the network is not picking up the right volatility trends in the data as the activation function is not appropriate for the type of data under analysis. Therefore, superior performance for MFE becomes irrelevant under these circumstances.

In this regard, one should not use ReLU (or any activation function for that matter) blindly – it may not be appropriate for the data (or model) in question.