Solved – Look ahead bias induced by standardization of a time series

feature-scalingmachine learningstandardizationtime series

Let's say I'm using some machine learning model to predict future values of a time series (e.g. stock price, air temperature, etc). In my model, I'm using some autoregressive features such as lagged target variable, rolling mean of the target variable and some other time series data (e.g. some macroeconomic index price, cloud coverage, etc).

To perform the standardization of the model features, I would usually split my dataset to training and validation set and use for example the StandardScaler from the scikit-learn library in a following way: I would apply its fit method on the training set and then apply its transform method on both training and validation set.

However, I'm having the following consideration using the above procedure on time series data:

fit method of the StandardScaler computes mean and standard deviation of the whole training set and applies this information to each data point in the training set. In my understanding this means that during training, the model has the insights from the future (i.e. mean and standard deviation of the whole training set) incorporated in its features and is then using this future information to predict the past/present. Is this considered as a look ahead bias? If this is the case, can we conclude that applying any kind of standardization/ normalization technique that operates on the whole training set is by itself problematic?

I was searching the internet for an answer but I couldn't find any discussion related to this topic. I have found a similar question here where @Wayne offered a hint in that direction but hasn't elaborated it. I also think that this question might be related to my question but it doesn't have an answer.

Best Answer

Yes this is considered a look-ahead bias. If a normalized value is lower than the mean (I.e 0 if standardized to the normal distribution), the model can infer that in the future it is going to increase. Your predictions going forward will not have this information. You have to find another way to normalize, some suggestions:

  1. Normalize using an expanding window where the value at each time point x(t) is normalized by taking the mean of values from x(0) to x(t-1).

  2. Similar to 1 but using a rolling window of the past k number of observations.

  3. If your data is multivariate, there may be an opportunity for you to normalize a feature based on the value of a concurrently occurring feature.

Since you mentioned scikit-learn, in python you can do 1. with a pandas df by df.expanding().mean() and 2. by df.rolling(k).mean()