Solved – the output of a tf.nn.dynamic_rnn()

deep learninggrulstmrecurrent neural networktensorflow

I am not sure about what I understand from the official documentation, which says:

Returns:
A pair (outputs, state) where:

outputs: The RNN output Tensor.

If time_major == False (default), this will be a Tensor shaped:
[batch_size, max_time, cell.output_size].

If time_major == True, this will be a Tensor shaped: [max_time,
batch_size, cell.output_size]
.

Note, if cell.output_size is a (possibly nested) tuple of integers or
TensorShape objects, then outputs will be a tuple having the same
structure as cell.output_size, containing Tensors having shapes
corresponding to the shape data in cell.output_size.

state: The final state. If cell.state_size is an int, this will be
shaped [batch_size, cell.state_size]. If it is a TensorShape, this
will be shaped [batch_size] + cell.state_size. If it is a (possibly
nested) tuple of ints or TensorShape, this will be a tuple having the
corresponding shapes. If cells are LSTMCells state will be a tuple
containing a LSTMStateTuple for each cell.

Is output[-1] always (in all three cell types i.e. RNN, GRU, LSTM) equal to state (second element of return tuple)? I guess the literature everywhere is too liberal in the use of the term hidden state. Is hidden state in all three cells the score coming out (why it is called hidden is beyond me, it would appear cell state in LSTM should be called the hidden state as it is not exposed)?

Best Answer

Yes, cell output equals to the hidden state. In case of LSTM, it's the short-term part of the tuple (second element of LSTMStateTuple), as can be seen in this picture:

LSTM

But for tf.nn.dynamic_rnn, the returned state may be different when the sequence is shorter (sequence_length argument). Take a look at this example:

n_steps = 2
n_inputs = 3
n_neurons = 5

X = tf.placeholder(dtype=tf.float32, shape=[None, n_steps, n_inputs])
seq_length = tf.placeholder(tf.int32, [None])

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, sequence_length=seq_length, dtype=tf.float32)

X_batch = np.array([
  # t = 0      t = 1
  [[0, 1, 2], [9, 8, 7]], # instance 0
  [[3, 4, 5], [0, 0, 0]], # instance 1
  [[6, 7, 8], [6, 5, 4]], # instance 2
  [[9, 0, 1], [3, 2, 1]], # instance 3
])
seq_length_batch = np.array([2, 1, 2, 2])

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  outputs_val, states_val = sess.run([outputs, states], 
                                     feed_dict={X: X_batch, seq_length: seq_length_batch})

  print(outputs_val)
  print()
  print(states_val)

Here the input batch contains 4 sequences and one of them is short and padded with zeros. Upon running you should something like this:

[[[ 0.2315362  -0.37939444 -0.625332   -0.80235624  0.2288385 ]
  [ 0.9999524   0.99987394  0.33580178 -0.9981791   0.99975705]]

 [[ 0.97374666  0.8373545  -0.7455188  -0.98751736  0.9658986 ]
  [ 0.          0.          0.          0.          0.        ]]

 [[ 0.9994331   0.9929737  -0.8311569  -0.99928087  0.9990415 ]
  [ 0.9984355   0.9936006   0.3662448  -0.87244385  0.993848  ]]

 [[ 0.9962312   0.99659646  0.98880637  0.99548346  0.9997809 ]
  [ 0.9915743   0.9936939   0.4348318   0.8798458   0.95265496]]]

[[ 0.9999524   0.99987394  0.33580178 -0.9981791   0.99975705]
 [ 0.97374666  0.8373545  -0.7455188  -0.98751736  0.9658986 ]
 [ 0.9984355   0.9936006   0.3662448  -0.87244385  0.993848  ]
 [ 0.9915743   0.9936939   0.4348318   0.8798458   0.95265496]]

... which indeed shows that state == output[1] for full sequences and state == output[0] for the short one. Also output[1] is a zero vector for this sequence. The same holds for LSTM and GRU cells.

So the state is a convenient tensor that holds the last actual RNN state, ignoring the zeros. The output tensor holds the outputs of all cells, so it doesn't ignore the zeros. That's the reason for returning both of them.