Solved – Update of the cell state functions in LSTM RNNs – Interplay of sigmoid and tanh

artificial intelligencelstmmachine learningneural networksrecurrent neural network

In trying to get some sort of intuition into what goes on inside an LSTM, there is a step that has the potential to make things fall in place nicely.

Here is a diagram from this tutorial illustrating the step in question with nomenclature notes added (feel free to correct mistakes):

enter image description here

I understand $[h_{t-1}, x_t]$ as the vector (or matrix (?)) concatenating the previous activation layer or output $(h_{t-1})$ and the new input $x_t$, while sigma $(\sigma)$ stands for sigmoid activation. $\tilde C_t$ is the update gate (?) or proposed new cell state.

I presume that $\otimes$ in the diagram (minor design differences aside) is the Hadamard product.

Both $\sigma$ in the $i_t$ equation, and $\tanh$ in the $\tilde C_t$ formula take in the concatenated vector $[h_{t-1},x_t]$ with learned weights and biases.

The question is:

What is the intuition of the purpose behind obtaining the sigmoid
activation and the tanh activation of the same vector, and then
combining them through the element-wise multiplication (Hadamard)?

I see that sigmoid provides a weight for each element of the vector, but what is the purpose / convenience / usefulness in using $\tanh$? In part the answer lies in minimizing the vanishing gradient problem, but there are more patterns and ideas behind the $\sigma / \tanh$ interplay.

For instance, in this Quota answer:

When the portion of signal arrives, the gate regulates which parts of the signal should be allowed into the unit and how much of those parts should be allowed. That’s why the gates use sigmoid activation function which takes values from 0 to 1 and not tanh (-1 to 1), because it functions like a filter.

… this concepts seem to fall in place.

Best Answer

  • sigmoid squashes into the range 0 .. 1. The output of a sigmoid is like a binary on/off. But a binary on/off is not differentiable, cannot learn through back-propagation, so we 'soften' the binary on/off 0/1, to a sigmoid, which is differentiable
  • the tanh is the actual updates that we want to apply, everywhere that the sigmoid is outputting 1, or near to 1. tanh squashes into the range -1 .. 1. It can be negative as well as positive, since we might wish to apply negative changes to the cell state.