Solved – Using lagged dependent variables in machine learning regression

ardlmachine learningrandom forestregressiontime series

I'm building a machine learning (random forest) regression model to predict flow in a river, using rainfall, relative humidity, air temperature and certain other climatic variables. Since flow on a particular day (flow_t) is highly correlated with flow on previous day (flow_t_1), I want to include lagged flow in the model formulation.

In case I build the model this way:

require(randomForest)
flow.rf=randomForest(flow_t~flow_t_1+temp+humidity..........)

How can I use the above model for predictions?
Since the input dataset for prediction will not have the flow variable, I cannot include its lagged version in the prediction call. I know that the dynlm package can be used to perform 'autoregressive distributed lag modeling' to include lagged dependent variables, but how can this be done for machine learning models? Or even for other statistical modeling techniques, like GLMs and GAMs?

Best Answer

  1. If you do not have the flow variable, how can you know that it is highly correlated?
  2. Related to the first point, how can you predict the flow if you do not have the variable at first ? You would need the flow variable to make your regression.
  3. In the macroeconomic and finance literature, there are ways to include latent (non-observable) series using Vector Autoregression (VAR) and Maximum Likelihood estimation. I suggest you have a look at Ang & Piazzesi (2003) in which they explain how they can "recreate" the latent series which, in a VAR, is also a dependent variable.
  4. Once you obtain the series for flow, you can add flow_t-1 as an explanatory variable. To obtain the forecast one period ahead (at $t+1$), you simply fit the model using the data you have at time t. For more than 1 period ahead forecasts, you need slightly more complicated dynamics.