Solved – Panel Regression vs. XGBoost Time Series Features

boostingcross-sectionmachine learningpanel datatime series

Panel regression is a technique to merge longitudinal and cross sectional data together in a linear model. Linear model doesnt work well since by bringing time series features into the model, it can suffer from heteroskedastcity. Panel regression solves this.

My question is whether XGBoost can solve all these problems without any linearity assumptions or ensuring the data is homoskedastcitic.

Say the input data is as follows:

User A -> Time series of y_A , demographic features X_A, time series of weather temperature of location A W_A
User B -> Time series of y_B , demographic features X_B, time series of weather temperature of location B W_B
... all other users

I can then transform this cross sectional and longitudinal data into the following training feature matrix:

User_A , y_A(t-1),X_A, W_A(t-1)   LABEL =  y_A(t)
User_A , y_A(t-2),X_A, W_A(t-2)   LABEL =  y_A(t-1)
User_A , y_A(t-3),X_A, W_A(t-3)   LABEL =  y_A(t-2)
... all other time series of user A
User_B , y_B(t-1),X_B, W_B(t-1)   LABEL =  y_B(t)
User_B , y_B(t-2),X_B, W_B(t-2)   LABEL =  y_B(t-1)
User_B , y_B(t-3),X_B, W_B(t-3)   LABEL =  y_B(t-2)
... all other time series of user B
... all other users

Now, that we have transformed the data into a standard regression feature matrix, we can train a user level XGboost model and then use this to forecast the future as well using all the features. Does this make sense to do? Are there any limitations of this approach? I dont need to worry about stationarity since its a non linear model.

Best Answer

When you transform the data as you describe, the problem is that the rows in your data matrix no longer represent independent samples. While users may plausibly be assumed to be independent samples, time points for a given user are very likely to be dependent on previous time points. So this would violate the assumption that samples in your training and test set (as well as new data in production/deployment) are independent and identically distributed, meaning that you couldn't trust your performance estimates.

Instead, if you want to use machine learning algorithms for panel forecasting, a typical approach to this kind of prediction task is the following:

Regarding your input data (X), treating users as i.i.d. samples, you can

  • bin the time series and treat each bin as a separate column ignoring any temporal ordering, with equal bins for all users, the bin size could of course simply be the observed time series measurement, or you could upsample and aggregate into larger bins,
  • or use specialised time series regression/classification algorithms.

Regarding your output data (y), if you want to forecast multiple time points in the future, you can

  • fit an estimator for each step ahead that you want to forecast, always using the same input data,
  • or fit a single estimator for the first step ahead and in prediction, roll the input data in time, using the first step predictions to append to the observed input data to make the second step predictions and so on.

Another typical approach is to extract features from the time series for each user, and use each extracted feature as a separate columns.

All of the approaches above basically reduce the panel forecasting problem to a time series regression problem. Once your data is in the time series regression format, you can append any non time dependent features for users.

Of course there are other options to solve the panel forecasting problem, like for example using classical forecasting methods like ARIMA adapted to panel data or deep learning methods that allow you to directly make sequence to sequence predictions.

Related Question