Solved – Random Forest regression for time series prediction

autoregressivecross-validationforecastingrandom foresttime series

I'm attempting to utilise RF regression to make predictions on the performance of a paper mill.

I have minute by minute data for the inputs (rate and amount of wood pulp going in etc…) as well as for the performance of the machine (paper produced, power drawn by the machine) and am looking to make predictions 10 minutes ahead on the performance variables.

I've got 12 months of data, so have separated it into 11 months for the training set, and the final month for testing.

So far I have created 10 new features which are lagged values by 1-10 minutes for each of the performance variables, and used these as well as the inputs to make predictions. The performance on the test set has been quite good ( the system is quite predictable), but I'm worried that I'm missing something in my approach.

For example, in this paper, the authors state their approach in testing the predictive ability of their random forest model:

The simulation proceeds by iteratively adding a new week of data, training a new model based on the updated data, and predicting the number of outbreaks for the following week

How is this different from utilizing 'later' data in the time series as testing? Should I be validating my RF regression model with this approach as well as on the testing data set? Furthermore, is this sort of 'autoregressive' approach to random forest regression valid for time series, and do I even need to create this many lagged variables if I'm interested in a prediction 10 minutes in the future?

Best Answer

How is this different from utilizing 'later' data in the time series as testing?

The approach you quote is called "rolling origin" forecasting: the origin from which we forecast out is "rolled forward", and the training data is updated with the newly available information. The simpler approach is "single origin forecasting", where we pick a single origin.

The advantage of rolling origin forecasting is that it simulates a forecasting system over time. In single origin forecasting, we might by chance pick an origin where our system works very well (or very badly), which might give us an incorrect idea of our system's performance.

One disadvantage of rolling origin forecasting is its higher data requirement. If we want to forecast out 10 steps with at least 50 historical observations, then we can do this single-origin with 60 data points overall. But if we want to do 10 overlapping rolling origins, then we need 70 data points.

The other disadvantage is of course its higher complexity.

Needless to say, you should not use "later" data in rolling origin forecasting, either, but only use data prior to the origin you are using in each iteration.

Should I be validating my RF regression model with this approach as well as on the testing data set?

If you have enough data, a rolling origin evaluation will always inspire more confidence in me than a single origin evaluation, because it will hopefully average out the impact of the origin.

Furthermore, is this sort of 'autoregressive' approach to random forest regression valid for time series, and do I even need to create this many lagged variables if I'm interested in a prediction 10 minutes in the future?

Yes, rolling vs. single origin forecasting is valid for any predictive exercise. It doesn't depend on whether you use random forests or ARIMA or anything else.

Whether you need your lagged variables is something we can't counsel you on. It might be best to talk to a subject matter expert, who might also suggest other inputs. Just try your RF with the lagged inputs vs. without. And also compare to standard benchmarks like ARIMA or ETS or even simpler methods, which can be surprisingly hard to beat.

Related Question