Solved – Calculating test-time perplexity for seq2seq (RNN) language models

To compute the perplexity of a language model (LM) on a test sentence $s=w_1,\dots,w_n$ we need to compute all next-word predictions $P(w_1), P(w_2|w_1),\dots,P(w_n|w_1,\dots,w_{n-1})$.

My question is: How are these terms computed for a seq2seq language model (say using LSTMs)?

At training time, we encode $s$ as a vector $\mathbf x_s$, initialize the decoder with $\mathbf x_s$ (usually the last hidden state $\mathbf h_n$), provide it with the ground truth sequence $w_1,\dots,w_n$ as input and obtain a prediction $\hat w_1,\dots,\hat w_n$ which goes into a loss function, e.g. cross-entropy.

At test time, I assume:

Initializing the decoder with $\mathbf x_s$ would be cheating as it encodes information about upcoming words.
We cannot take an encoder hidden state $\mathbf h_i$ and use it as the decoder hidden state $\hat{\mathbf h}_{i}$ at some time $i>1$ since encoder and decoder weight matrices have learned different dynamics.

So my question boils down to: How can we get a good but fair decoder hidden state $\hat{\mathbf{h}}_i$ to predict word $\hat w_i$? How can it encode as much as possible about the history? One option that I see is:

Run the encoder to obtain $\mathbf h_1,\dots,\mathbf h_n$.
To obtain $\hat{\mathbf h}_{i-1}$, feed $\mathbf h_{i-1}$ as initial hidden state to the decoder, run it until time step $i-1$ always providing the ground truth as input.
Compute $\hat{\mathbf h}_i$ using $\hat{\mathbf h}_{i-1}$ and $w_{i-1}$.
Obtain $P(w_i|w_1,\dots,w_{i-1})$ by feeding $\hat{\mathbf h}_i$ into the softmax.

Can anybody conform that his is the way to do it?

Alternatively, one can of course feed an uninformative, constant/random initial hidden state to the decoder and just provide the ground truth sequence. However, this feels much weaker than the approach described above.

PS: I know that in machine translation the situation is very different as knowing the encoder state when decoding in another language is not cheating. However, there one would not be allowed to get ground truth input at test time.

Best Answer

How can we get a good but fair decoder hidden state $\hat h_i$ to predict word $\hat w_i$? How can it encode as much as possible about the history?

Once the language model has been trained the $\hat h_i$ can be got from the previous words $w_0, w_1, ..., w_{i-1}$ where $w_0$ represents the start of sentence token. The history information is encoded in the cell updated by all the previous words. We do not use the predicted $\hat w_i$ and hence will not affect the results, and the character we input into the model at each time step is the ground truth.

For example, when we calculating the probability of $w_i$(for i >= 1) the input is $h_{i-1}$(which is only affected by $w_0$ to $w_{i-2}$), $w_{i-1}$, and we can get a probability distribution for $w_{i}$ and we just index the probabilty($p(w_i|w_0, ..., w_{i-1})$) using $w_i$ and don't do the others(we don't get the word with max probability $\hat w_i$) and input ground truth $w_i$ for the next time step as the input.

And the probability of each character is indexed by the ground truth character in the softmax output and the mean perplexity is then calculated by this:

\begin{align} \text{Perplexity}(w_1, ..., w_n) &= \sqrt[n]{\frac{1}{\prod_{i=1}^{n} p(w_{i})}} \\ &= 2^{\log_{2}{[\prod_{i=1}^{n} p(w_i|w_0, ..., w_{i-1})]}^{-n}} \\ &= 2^{-\frac{1}{n}\log_{2}{[\prod_{i=1}^{n} p(w_i|w_0, ..., w_{i-1})]}} \\ &= 2^{-\frac{1}{n}\sum_{i=1}^{n}\log_{2}{p(w_i|w_0, ..., w_{i-1})}} \end{align}

For the detailed derivation please refer to this article: N-gramLanguage Models.

Best Answer

Related Solutions

Machine Learning – Is Teacher Forcing More Accurate Than Using Actual Model Output or Just Faster?

Solved – Is it true that Bahdanau’s attention mechanism is not Global like Luong’s

Related Question