What you recognised is a common issue and it occasionally manifests into situations where people are perplex as to "why am I predicting flat values?".
CV.SE has some very enlightening topics on this matter in: Why I get the same predict value in Arima model? and Flat Forecast from ARIMA and SARIMA.
Let's take as an example a simple time-series model, like a first order auto-regressive model AR(1), where $y_t = \beta_0 + \beta_1 y_{t-1} + \epsilon_t$ and $\epsilon_t \sim N(0, \sigma_\epsilon^2)$. In this case our estimates $\hat{y_t}$ are simply $\hat{y_t} = \hat{\beta_0} + \hat{\beta_1} y_{t-1}$ because $\epsilon_t$ is expected to be zero. Nevertheless as we extrapolate $y_{t-1}$ has to be itself estimated because it is unavailable. This leads to situation where after some point, we actually use our own predictions are input data.
The fact that "we use our own predictions as inputs" is epitomised by seeing that certain time-series algorithms are presented under a filtering approach, the Kalman filter and the Holt-Winters filter being prime and widely used examples.
So to become particular to what was originally mentioned: if we want to create our own forecasting routine that does not simply offer one-step-ahead forecast we need to be able to be populate our "lagged features" with their predicted values. That's why most forecasting routines (e.g. forecast::forecast
, smooth::forecast
, prophet::make_future_dataframe
, bsts::predict
, KFKSDS::predict
, etc.) have an explicit horizon
, periods
, n.ahead
, etc. argument. We need to know how far we look into the future to appropriate update/populate our beliefs to get there!
When you transform the data as you describe, the problem is that the rows in your data matrix no longer represent independent samples. While users may plausibly be assumed to be independent samples, time points for a given user are very likely to be dependent on previous time points. So this would violate the assumption that samples in your training and test set (as well as new data in production/deployment) are independent and identically distributed, meaning that you couldn't trust your performance estimates.
Instead, if you want to use machine learning algorithms for panel forecasting, a typical approach to this kind of prediction task is the following:
Regarding your input data (X), treating users as i.i.d. samples, you can
- bin the time series and treat each bin as a separate column ignoring any temporal ordering, with equal bins for all users, the bin size could of course simply be the observed time series measurement, or you could upsample and aggregate into larger bins,
- or use specialised time series regression/classification algorithms.
Regarding your output data (y), if you want to forecast multiple time points in the future, you can
- fit an estimator for each step ahead that you want to forecast, always using the same input data,
- or fit a single estimator for the first step ahead and in prediction, roll the input data in time, using the first step predictions to append to the observed input data to make the second step predictions and so on.
Another typical approach is to extract features from the time series for each user, and use each extracted feature as a separate columns.
All of the approaches above basically reduce the panel forecasting problem to a time series regression problem. Once your data is in the time series regression format, you can append any non time dependent features for users.
Of course there are other options to solve the panel forecasting problem, like for example using classical forecasting methods like ARIMA adapted to panel data or deep learning methods that allow you to directly make sequence to sequence predictions.
Best Answer
I would say that yes, using actual observations during training and and predicted observations during real use is valid.
This is a common approach in natural language generation.