Solved – Evaluating on training data gives different loss

deep learningloss-functionsmachine learningrecurrent neural networktensorflow

I'm training a simple neural network using keras with the tensorflow backend. As I don't know much about what I'm doing, I'm exploring things a little bit to try to get the hang of it and to try to figure out what is going on. My model is a GRU based RNN. I'm training it on multi-variate time-series data (4 input time series) and the goal is to try to predict future values of one of the inputs.

After the 20th epoch on the training set, Keras reported this:

Epoch 20/20
394311/394311 [==============================] - 50s - loss: 18.9257

Ok, so my loss is right around 19. I wanted to see how the model would generalize to my test dataset:

model.evaluate(x_test, y_test)  # 11.979977535933825

This felt too good to be true (maybe?). I'm actually not quite sure how to interpret a lower loss on the test data than on the training data … To try to figure that out, I decided to look at the loss computed on the training set:

model.evaluate(x_train, y_train)  # 16.165901696732373

This is clearly not the same value that was reported near the end of my last training epoch. I'm clearly missing something when it comes to knowing how to interpret the loss and how it's calculated… Any insight into why the loss value might be so much lower on the test data than the training data would be great. Also insight into why the loss at the end of the last epoch is so much different than the loss when I evaluate the model on the training data would also be welcome.

Best Answer

Well ... It looks like this is actually in the Keras FAQ:

A Keras model has two modes: training and testing. Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at testing time.

Besides, the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.