Solved – Regression model fails at predicting high and low values

neural networksnormalizationregressionsvmtime series

A regression model fails at predicting high and low values.

Context:

I'm training a regression model where the input and output data has same dimension (energy). The input consists on previous values of the energy for a given time and the output is a value of the energy in the future.

I'm scaling both the input and the output using standard sampler (I tried also min-max and robust scaler).

I'm using around 1.5k samples for training, and predicting the value on 40k samples (after seeing that there's not that a big of an increase in accuracy when training with more data). I'm using MAE to score the model, and I get around 300.

For models I'm using support vector machine and neural network (with 2 layers, 100 neurons each) and both exhibit same behaviour.

This is the plot I obtain when I compare the real and predicted values.
As you can see, the model fails at predicting high and low values, and I don't know what to try.

I made sure the input contains samples with low and high values.

Values sorted by value

Thank you

Best Answer

Your data are modeled with error. If you are using a standard regression, then this error is assumed to be symmetrically distributed around the expected value.

The key thing to observe is that regression aims at modeling and predicting the expected value. (Actually, this is not only the case for regression, but for almost all statistical or machine learning models, except for quantile or density prediction methods.)

So, assuming that you predicted a value that turned out to be very high, i.e., with a high $y$ coordinate in your plot. This value almost certainly had a high expectation, but - given that the observation is high - it also had a large positive error or noise term. The prediction was the expectation, this is the red dotted line. It's still high, but given the high error, the prediction is systematically too low. The same holds the other way around for the very low observations.

Thus, what you are seeing is not a shortcoming of your algorithm. It is simply a consequence of modeling expectations but observing with errors. Any prediction will underforecast very large and overforecast very small actuals. Put differently: conditional on observing a high actual, your prediction will be biased low, and vice versa. There is nothing you can do. Unless of course there are not-yet modeled influences, but even if you do model these, the same problem will remain.

The concept of regression toward the mean is related.

Related Question