Neural Network Forecasting – Can It Predict Higher Values Than Seen During Training?

forecastingneural networksnormalizationregressiontime series

I'm experiencing with time series forecasting using a simple transformer network (following this paper). The problem I'm facing is about the dataset, at least I suppose: splitting it by train/vaidation/test sets with respectively 60%, 10% ad 30% of the total samples results in the training set having maximum values way lower than those found during training. The training process behaves pretty much normally, with the training loss and validation loss slowly decreasing up to a certain minimum in the loss function (I'm trying both MSE loss and Mean Absolute Difference).

However, highest values in training set are around 10^3, while in the test set it is common to see values in the order of 10^4 (going even near 10^5). Of course, test performance is unsatisfactory.

Given that difference in magnitude between my splits, I stopped to employ min-max scaling and attempted to normalize every batch separately. Unfortunately, I cannot use this technique during inference since de-normalization of the predicted output is unfeasible without knowing a priori the original values.

This paper from Google's Deep Mind seems interesting: it suggests to scale the weight of the output layer before predicting the values during training. Anyway, this is a good manner to help the training process when the network is fed with input of several orders of magnitude higher than the usual. That's not really the same situation I am in, since I have no high values in my training split, but only in the test set.

Currently I am out of ideas and I'm wondering if this is a well-posed problem or not: can a neural network predict a value higher than any value seen during training? Is there any kind of normalization that can be helpful in this situation?

This post gives a negative answer to my first question, but it is about a random forest algorithm, I hope a deep neural network would be able to overcome scaling issue with the data.

Best Answer

We actually don't know enough to be helpful. A few pointers:

Your data may simply have a lot of noise, possibly skewed. Remember that your network (just like any model) is trying to disentangle the signal from the noise, and will predict only the signal (in general, predictions will vary less than observations, see here). Try generating IID lognormally distributed data with high log-variance: you will get very high peaks, but if you feed this to your NN, the predictions will be far lower (what the optimal predictions are depends on your evaluation function).

Alternatively, your high values may be predictable after all. Then you need to figure out which predictors are useful and feed these into your NN. How to know that your machine learning problem is hopeless?

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

I guess I got what is a problem with a gradient norm value. Basically negative gradient shows a direction to a local minimum value, but it doesn't say how far it is. For this reason you are able to configure you step proportion. When your weight combination is closer to the minimum value your constant step could be bigger than is necessary and some times it hits in wrong direction and in next epooch network try to solve this problem. Momentum algorithm use modified approach. After each iteration it increases weight update if sign for the gradient the same (by an additional parameter that is added to the $\Delta w$ value). In terms of vectors this addition operation can increase magnitude of the vector and change it direction as well, so you are able to miss perfect step even more. To fix this problem network sometimes needs a bigger vector, because minimum value a little further than in the previous epoch.

To prove that theory I build small experiment. First of all I reproduce the same behaviour but for simpler network architecture with less number of iterations.

import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from neupy import algorithms

plt.style.use('ggplot')

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))

data, target = make_regression(n_samples=10000, n_features=50, n_targets=1)

target_scaler = preprocessing.MinMaxScaler()
target = target_scaler.fit_transform(target)

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
    )),
])

mnet.fit(data, target, momentum__epochs=100)

After training I checked all gradients on plot. Below you can see similar behaviour as yours.

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

Also if look closer into the training procedure results after each epoch you will find that errors are vary as well.

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

Next I using almost the same settings create another network, but for this time I select Golden search algorithm for step selection on each epoch.

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))
    if network.epoch % 20 == 0:
        print("Epoch #{}: step = {}".format(network.epoch, network.step))

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
        optimizations=[algorithms.LinearSearch]
    )),
])

mnet.fit(data, target, momentum__epochs=100)

Output below shows step variation at each 20 epoch.

Epoch #0: step = 0.5278640466583575
Epoch #20: step = 1.103484809236065e-13
Epoch #40: step = 0.01315561773591515
Epoch #60: step = 0.018180616551587894
Epoch #80: step = 0.00547810271094794

And if you after that training look closer into the results you will find that variation in 2-norm is much smaller

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

And also this optimization reduce variation of errors as well

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

As you can see the main problem with gradient is in the step length.

It's important to note that even with a high variation your network can give you improve in your prediction accuracy after each iteration.

Solved – Simple Neural Network for time series prediction

I'm going to take a stab at this and say it could be a problem with normalization boundaries.

I'm not familiar with the AForge.net NN library, but at some point your data should be normalized to fit between 0 and 1.

At some point, the normalization process detected 1 as the minimum value and 20 as the max value, and from those bounds, every value is converted to fit between 0 and 1. For example,

1  -> 1/20 = 0.05
...
19 -> 19/20 = 0.95
20 -> 20/20 = 1

When you exceed these bounds later, you're normalization no longer produces values between 0 and 1 and this really wrecks havoc on the network.

25 -> 25/20 = 1.25

What you could do is ensure your normalization factors in your true max and min bounds.