Solved – What loss function should I use to score a seq2seq RNN model

deep learningloss-functionsrecurrent neural network

I'm working through the Cho 2014 paper which introduced encoder-decoder architecture for seq2seq modeling.

In the paper, they seem to use the probability of the output given input (or it's negative-log-likelihood) as the loss function for a input $x$ of length $M$ and output $y$ of length $N$:

$P(y_1, …, y_N | x_1, …, x_M) = P(y_1 | x_1, …, x_m) P(y_2 | y_1, x_1, …, x_m) \dots P(y_N | y_1, …, y_N-1, x_1, …, x_m)$

However, I think I see several problems with using this as a loss function:

  1. It seems to assume teacher forcing during training (ie, instead of using the decoder's guess for a position as the input to the next iteration, it uses the known token.
  2. It wouldn't penalize long sequences. Since the probability is from $1$ to $N$ of the output, if the decoder generated a longer sequence everything after the first $N$ would not factor into the loss.
  3. If the model predicts an early End-of-String token, the loss function still demands $N$ steps — which means we are generating outputs based on an untrained "manifold" of the models. That seems sloppy.

Are any of these concerns valid? If so, has there been any progress into a more advanced loss function?

Best Answer

It seems to assume teacher forcing during training (ie, instead of using the decoder's guess for a position as the input to the next iteration, it uses the known token.

The term "teacher forcing" bothers me a bit, because it kind of misses the idea: There's nothing wrong or weird with feeding the next known token to the RNN model -- it's literally the only way to compute $\log P(y_1, \ldots, y_N)$. If you define a distribution over sequences autoregressively as $P(y) = \prod_i P(y_i | y_{<i})$ as is commonly done, where each conditional term is modeled with an RNN, then "teacher forcing" is the one true procedure which correctly maximizes log likelihood. (I omit writing the conditioning sequence $x$ above because it doesn't change anything.)

Given the ubiquity of MLE and the lack of good alternatives, I don't think assuming "teacher forcing" is objectionable.

Nonetheless there are admittedly issues with it -- namely, the model assigns high likelihood to all data points, but samples from the model are not necessarily likely in the true data distribution (which results in "low quality" samples). You may be interested in "Professor Forcing" (Lamb et al.) which mitigates this via an adversarial training procedure without giving up MLE.

It wouldn't penalize long sequences. Since the probability is from 1 to N of the output, if the decoder generated a longer sequence everything after the first N would not factor into the loss.

and

If the model predicts an early End-of-String token, the loss function still demands N steps -- which means we are generating outputs based on an untrained "manifold" of the models. That seems sloppy.

Neither of these are problems which occur while training. Instead of thinking of an autoregressive sequence model as a procedure to output a prediction, think of it as a way to compute how probable a given sequence is. The model never predicts anything -- you can sample a sequence or a token from a distribution, or you can ask it what the most likely next token is -- but these are crucially different from a prediction (and you don't sample during training either).

If so, has there been any progress into a more advanced loss function?

There may well be objectives specifically designed on a case-by-case basis for different modeling tasks. However I would say MLE is still dominant -- the recent GPT2 model which achieved state-of-the-art performance on a broad spectrum of natural language modeling and understanding tasks was trained with it.

Related Question