Solved – very high frequency time series analysis (seconds) and Forecasting (Python/R)

arimaforecastingpythonseasonalitytime series

I have high frequency data (observations separated by seconds), which I'd like to analyse and eventually forecast short-term periods (1/5/10/15/60 min ahead) using ARIMA models.
My whole data set is very large (15 million obs.). My goal is to come out with conclusions of how good the ARIMA models can forecast the data compared to other simple methods,NAIVE and SES which I already did.
Since the data is so large, I started looking at the plots of different days and also at the ACF and PACF of the data (to search for patterns).

a. The data is stationary – The ADF test has strongly approved that as well (so no need for differencing)

b. The ACF plot shows strong hourly seasonality but also in the first 500 lags (minute seasonality?). I also assume a daily seasonality (lags at 86,400)

Questions issue #1:

If I'd like for a start, to take one day sample(from 00:00- 23:59) and forecast the hour of the next day (00:00 – 00:59, day2), that is forecast out of sample,

1.1 what should be the amount of data, on which I should fit an ARIMA model ? is there any rule regarding picking the 'training set' (especially for short term forecasting)?

In my case, it becomes quite complicated, since the order of NON-SEASONAL ARIMA is found quite high for the whole 24hours sample (12,0,3). The second and more critical problem, is that if I take into account the seasonal high- order, (3,0,0)s=3600, then the computer runs out of memory. I tried that on Python as well on R.

1.2 Any idea how to deal with such case?

Questions issue #2:

What I've tried so far was to average every 60 seconds of the data, so I got the resampled data of 1440 observations a day (representing minutes). After doing that I could deal more easily with the seasonal periods, which now s = 60, and fortunately work with R and Python (However, doing so I might have lost valuable information..).
since for each data length I'd get different ARIMA order. e.g for 7h: ARIMA(2,0,0)(1,0,0)[60] and ….
If I want to use even shorter fitting periods, I had to difference the data, since the pattern becomes non-stationary. e.g for 2h: ARIMA(3,1,1)(0,0,1)[60].
So I tried to test several approaches and see how well is the forecast in "out-of sample" (1hour). It appeared that out of these approaches, the shorter time fitting (2h) produced slightly better results than the longer ones. But again, this could also be just a coincidence..

2.1: so the first question still remains, what is the length of data on which I should fit the ARIMA on the training set?
2.2: and does it make sense to use a short training set for short term forecasting period?

Questions issue #3:

Assuming the other "longer fitting" (say 24h) caused "overfitting" for such a short-time period forecasting and that I'd to use 2hours ARIMA fit. Reminding that I'm not specifically interested in the forecasting a particular time interval, but I'm interested in using the results of many tests to try and make conclusions about forecasting ARIMA models on my data. When it's doing well or not (compared to other methods, using MAPE). The problem is that I'd have to fit an ARIMA model on say every 2 hours again and again and then accumulate the results.

3.1 But giving the fact that I have 6 months data, that wouldn't be that very practical wouldn't it? so does anybody have any suggestion how to approach this? should I perhaps take a sample from each day?

Questions issue #4:

Non of the dozens of fitting ARIMA tests which I have done, produced a "white noise" on the ACF of residuals! (also tested other orders, other seasonal periods). Instead, 2 lags were significantly out of the interval (lag 11 and 19).
4.1 Is it possible to have a situation where you just can't find the right model that produces a "white noise" on the ACF of residuals? or does it imply that there should be something wrong with my analysis?

**Questions issue #5:

(related to Python in particular and to trend = 'c' in general)**
In the examples that I've seen so far in the literature, there was minimal attention to stationary data (most were non-stationary probably because most were financial data..).

5.1 Therefore I didn't understand why it wasn't possible to forecast stationary data, when I set the trend parameter to constant ('trend'='c'), but worked when set it to non ('trend'='n')?

I used that on python (statsmodels), so I don't know if that is a problem with the function or it just not wise to add a constant.
For those of you who work with statsmodels:

import statsmodels.api as sm 

arima_fit =  sm.tsa.SARIMAX(data_set, order = (2,0,0), seasonal_order = (1,0,0,60), trend = 'c').fit() 
data_forecast = arima_fit.predict('period start', 'period end', dynamic = True)

I got the following error:
"ValueError: Forecasting for models with time-varying state_intercept matrix
requires an updated time-varying matrix for the period to be forecasted.
"

I'm relatively new into time series, so please forgive me for the simplicity of my questions.
Concerning the literature and examples, so far I couldn't find any examples which deal with such of characteristics as in my data (very high frequency) in ARIMA models. So if anyone can recommend something would be awesome.

Best Answer

Question #1:

The problem is that in the MLE case, both the Python (statsmodels) and R procedures use state-space models to estimate the likelihood. In an SARIMAX class, the state-space grows linearly (or worse) with the number of seasons (because the state-space form incorporates all intermediate lags too - so if you have a lag at 3600, the state-space form also has all the 3599 intermediate lags).

So you now have a couple of issues - first, you're multiplying 3600+ matrices by each other, which is slow. Even worse, state space models need to be initialized and often they are by default initialized using a stationary initialization that requires solving a 3600 linear system. When I tested a 3600 seasonal order, it wasn't even getting past this part.

The R arima function accepts method='CSS' which uses least-squares (conditional MLE instead of full MLE) to solve the problem. Depending on how the arima function works, it could be much better in your case.

In Python, there aren't many good options. The SARIMAX class accepts a conserve_memory option, but if you do that, you can't forecast. To solve the initialization problem, you can call the initialize_approximate_diffuse method to avoid the 3600 linear system solving. However, even in these cases, you'll be multiplying 3600 x 3600 matrices together, which will be quite slow. I would like to update the SARIMAX class to work with sparse matrices (which would solve this problem) but that's probably quite a ways in the future. I don't know of any non-commercial program that implements state space models using sparse matrices.

Question #5:

This was a bug in the statsmodels code. It has been fixed in the repository (see https://github.com/ChadFulton/statsmodels/issues/2)