Question #1:
The problem is that in the MLE case, both the Python (statsmodels) and R procedures use state-space models to estimate the likelihood. In an SARIMAX class, the state-space grows linearly (or worse) with the number of seasons (because the state-space form incorporates all intermediate lags too - so if you have a lag at 3600, the state-space form also has all the 3599 intermediate lags).
So you now have a couple of issues - first, you're multiplying 3600+ matrices by each other, which is slow. Even worse, state space models need to be initialized and often they are by default initialized using a stationary initialization that requires solving a 3600 linear system. When I tested a 3600 seasonal order, it wasn't even getting past this part.
The R arima function accepts method='CSS' which uses least-squares (conditional MLE instead of full MLE) to solve the problem. Depending on how the arima function works, it could be much better in your case.
In Python, there aren't many good options. The SARIMAX class accepts a conserve_memory
option, but if you do that, you can't forecast. To solve the initialization problem, you can call the initialize_approximate_diffuse
method to avoid the 3600 linear system solving. However, even in these cases, you'll be multiplying 3600 x 3600 matrices together, which will be quite slow. I would like to update the SARIMAX class to work with sparse matrices (which would solve this problem) but that's probably quite a ways in the future. I don't know of any non-commercial program that implements state space models using sparse matrices.
Question #5:
This was a bug in the statsmodels code. It has been fixed in the repository (see https://github.com/ChadFulton/statsmodels/issues/2)
How is this different from utilizing 'later' data in the time series
as testing?
The approach you quote is called "rolling origin" forecasting: the origin from which we forecast out is "rolled forward", and the training data is updated with the newly available information. The simpler approach is "single origin forecasting", where we pick a single origin.
The advantage of rolling origin forecasting is that it simulates a forecasting system over time. In single origin forecasting, we might by chance pick an origin where our system works very well (or very badly), which might give us an incorrect idea of our system's performance.
One disadvantage of rolling origin forecasting is its higher data requirement. If we want to forecast out 10 steps with at least 50 historical observations, then we can do this single-origin with 60 data points overall. But if we want to do 10 overlapping rolling origins, then we need 70 data points.
The other disadvantage is of course its higher complexity.
Needless to say, you should not use "later" data in rolling origin forecasting, either, but only use data prior to the origin you are using in each iteration.
Should I be validating my RF regression model with this
approach as well as on the testing data set?
If you have enough data, a rolling origin evaluation will always inspire more confidence in me than a single origin evaluation, because it will hopefully average out the impact of the origin.
Furthermore, is this sort
of 'autoregressive' approach to random forest regression valid for
time series, and do I even need to create this many lagged variables
if I'm interested in a prediction 10 minutes in the future?
Yes, rolling vs. single origin forecasting is valid for any predictive exercise. It doesn't depend on whether you use random forests or ARIMA or anything else.
Whether you need your lagged variables is something we can't counsel you on. It might be best to talk to a subject matter expert, who might also suggest other inputs. Just try your RF with the lagged inputs vs. without. And also compare to standard benchmarks like ARIMA or ETS or even simpler methods, which can be surprisingly hard to beat.
Best Answer
This is the expected behavior. AR and ARMA use only past information. If there are fluctuations, then we are always behind. Without additional information that can be used in forecasting we cannot do much better.
For example if this is the forecasting the daily weather.
Based on todays data, the best forecast for tomorrow will be that the weather is close to today. If the temperature is increasing over several days and we get sunny days, then we will underestimate each of these days. If the weather turns bad over several days, then we will always overestimate how good the weather will be.
Now suppose the plots are the hourly temperature over a day.
In this case we can use the time of day to model and forecast that the temperature is on average lower in the night, higher during the day, increasing after sunrise and decreasing after sunset. Using this seasonal pattern either with a deterministic part like dummies or seasonal polynomials or by using a seasonal ARIMA, we can capture to daily pattern and avoid, for example, lagging behind in our forecast as the temperature increases between 8 am and 10 am.