Neural Network Forecasting – Can It Predict Higher Values Than Seen During Training?

forecastingneural networksnormalizationregressiontime series

I'm experiencing with time series forecasting using a simple transformer network (following this paper). The problem I'm facing is about the dataset, at least I suppose: splitting it by train/vaidation/test sets with respectively 60%, 10% ad 30% of the total samples results in the training set having maximum values way lower than those found during training. The training process behaves pretty much normally, with the training loss and validation loss slowly decreasing up to a certain minimum in the loss function (I'm trying both MSE loss and Mean Absolute Difference).

However, highest values in training set are around 10^3, while in the test set it is common to see values in the order of 10^4 (going even near 10^5). Of course, test performance is unsatisfactory.

Given that difference in magnitude between my splits, I stopped to employ min-max scaling and attempted to normalize every batch separately. Unfortunately, I cannot use this technique during inference since de-normalization of the predicted output is unfeasible without knowing a priori the original values.

This paper from Google's Deep Mind seems interesting: it suggests to scale the weight of the output layer before predicting the values during training. Anyway, this is a good manner to help the training process when the network is fed with input of several orders of magnitude higher than the usual. That's not really the same situation I am in, since I have no high values in my training split, but only in the test set.

Currently I am out of ideas and I'm wondering if this is a well-posed problem or not: can a neural network predict a value higher than any value seen during training? Is there any kind of normalization that can be helpful in this situation?

This post gives a negative answer to my first question, but it is about a random forest algorithm, I hope a deep neural network would be able to overcome scaling issue with the data.

Best Answer

We actually don't know enough to be helpful. A few pointers:

Your data may simply have a lot of noise, possibly skewed. Remember that your network (just like any model) is trying to disentangle the signal from the noise, and will predict only the signal (in general, predictions will vary less than observations, see here). Try generating IID lognormally distributed data with high log-variance: you will get very high peaks, but if you feed this to your NN, the predictions will be far lower (what the optimal predictions are depends on your evaluation function).

Alternatively, your high values may be predictable after all. Then you need to figure out which predictors are useful and feed these into your NN. How to know that your machine learning problem is hopeless?