Solved – Why are the weights of RNN/LSTM networks shared across time

lstmmachine learningrecurrent neural network

I've recently become interested in LSTMs and I was surprised to learn that the weights are shared across time.

  • I know that if you share the weights across time, then your input time sequences can be a variable length.

  • With shared weights you have many fewer parameters to train.

From my understanding, the reason one would turn to an LSTM vs. some other learning method is because you believe there is some sort of temporal/sequential structure/dependence in your data that you would like to learn. If you sacrifice the variable length ‘luxury’, and accept long computation time, wouldn’t an RNN/LSTM without shared weights (i.e. for every time step you have different weights) perform way better or is there something I’m missing?

Best Answer

The accepted answer focuses on the practical side of the question: it would require a lot of resources, if there parameters are not shared. However, the decision to share parameters in an RNN has been made when any serious computation was a problem (1980s according to wiki), so I believe it wasn't the main argument (though still valid).

There are pure theoretical reasons for parameter sharing:

  • It helps in applying the model to examples of different lengths. While reading a sequence, if RNN model uses different parameters for each step during training, it won't generalize to unseen sequences of different lengths.

  • Oftentimes, the sequences operate according to the same rules across the sequence. For instance, in NLP:

                                                     "On Monday it was snowing"

                                                     "It was snowing on Monday"

...these two sentences mean the same thing, though the details are in different parts of the sequence. Parameter sharing reflects the fact that we are performing the same task at each step, as a result, we don't have to relearn the rules at each point in the sentence.

LSTM is no different in this sense, hence it uses shared parameters as well.