Solved – What does alignment between input and output mean for recurrent neural network

machine learningneural networksrecurrent neural network

I'm working with RNNs and during research I found many mentions of alignment between input and output.

For example (Sutskever et al., 2014):

The RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.

There are also implementations of the encoder-decoder architecture that talks about soft-alignment.

Now, I think that the alignment issue means that one time step in the input not necessarily maps to output on the same time step, but I am not sure if this explanation is correct. I can not find any good resources for explaining how alignment between input and output and just what that means. Any help or explanation is much appreciated.

Best Answer

What the paragraph from Sutskever and Sutskever means is, in a single RNN, the RNN receives a sequence of inputs, and gives a sequence of outputs , typically the same number of outputs as inputs, like:

Inputs:  i1 i2 i3 i4 i5 i6 i7
Outputs:    o1 o2 o3 o4 o5 o6 o7

So, for each input, we get an output, which could be for example a prediction for the input at the next timestep, though that's not obligatory, just a typical use-case.

Now, this works for tasks such as predicting the next word of a sentence, like i1 is the first word, and i2 is the second. o1 is a prediction for i2, and o2 is a prediction for i3. There is a one-to-one mapping of inputs and outputs.

However, eg to translate from French to English, the number of input words and output words might not match:

Il pleut
It is    raining

2 words => 3 words

Sequence to sequence solves this by having two RNNs, back to back. The first one takes an arbitrary sequence, and maps it to a single embedding vector, being simply the hidden state of the RNN, after receiving all the input words.

i1 i2 i3 i4 i5 i6 i7 ... => embedding-vector

Then the second RNN is initialized with this embedding-vector, and then predicts words freely, with no further inputs, until it outputs a termination token.

embedding-vector => o1 o2 o3 ... termination-token

Putting these together, we pump in a sequence of one length, and the output can be a different length, eg:

i1 i2 i3 => embedding-vector => o1 o2 o3 o4 o5 termination-token