Machine Learning – Is Teacher Forcing More Accurate Than Using Actual Model Output or Just Faster?

machine learningneural networksrecurrent neural network

In Recurrent Neural Networks with connections from output units to the hidden units, we can use teacher forcing to make the training process faster by parallelization of learning in different time steps. In teacher forcing, we use the ground truth output in the current time step(available in the training data) to compute the system state in the next time steps. It is obviously faster than using the actual model output during training. But the question is that whether this is also more accurate?
Maybe if we are not worried about training time, it is better to use the actual model output instead of ground truth outputs, since, when the model is deployed, the model output is ultimately used to produce the system state in next time steps.

Best Answer

I'll begin by saying I'm no expert but was thinking about this same question. A little googling led me to this page:

https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

and, in turn, this paper:

https://arxiv.org/pdf/1610.09038.pdf

which as a paragraph addressing this to some degree in the introduction:

Unfortunately, this procedure[teacher forcing] can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training

In addition, from the deeplearning.org book (http://www.deeplearningbook.org/contents/rnn.html) p378:

The disadvantage of strict teacher forcing arises if the network is going to be later used in a closed-loop mode, with the network outputs (or samples from the output distribution) fed back as input. In this case, the fed-back inputs that the network sees during training could be quite dfferent from the kind of inputs that it will see at test time.

I would imagine (again, not an expert) that it is fairly problem dependent but that the main gain of teacher forcing is in the computational training and simplifying the loss landscape (i.e. since the whole sequence will contribute to the gradient of the parameters for a long sequences back propagation through time may make it difficult for the optimiser to converge even if it has a lot of computational time.)

Hope that helps!