Time Series – SARIMAX Statsmodels Forecast Predicts NaN Values After 52

arimaforecastingseasonalitystatsmodelstime series

I have a dataset of bike renting between 2014 and 2020, and I want to use data between 2014 and 2018 to predict future usage, and to test it on 2019.

However when using predict, I get Nan values after 52 days:

The dataset looks like this:

Bike usage between 2014 and 2020

Geting training and testing sets:

train_end = datetime(2018, 11, 15)
train_data = lim_station_bike_demand[:train_end]

test_begin = datetime(2019, 4, 14)
test_end = datetime(2019, 10, 31)
test_data = lim_station_bike_demand[test_begin:test_end]

Fitting the model:

my_order = (0, 1, 0)
my_seasonal_order = (1, 0, 1, 365)
model = SARIMAX(train_data, order = my_order, seasonal_order = my_seasonal_order)
model_fit = model.fit()
print(model_fit.summary())

Results:

                                       SARIMAX Results                                       
=============================================================================================
Dep. Variable:                            trip_count   No. Observations:                 1676
Model:             SARIMAX(0, 1, 0)x(1, 0, [1], 365)   Log Likelihood               -8626.412
Date:                               Fri, 26 Nov 2021   AIC                          17258.824
Time:                                       01:36:39   BIC                          17275.094
Sample:                                   04-15-2014   HQIC                         17264.852
                                        - 11-15-2018                                         
Covariance Type:                                 opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.S.L365     -0.6759      0.437     -1.546      0.122      -1.533       0.181
ma.S.L365      0.7154      0.430      1.662      0.097      -0.128       1.559
sigma2      1739.1146     34.307     50.692      0.000    1671.874    1806.355
===================================================================================
Ljung-Box (L1) (Q):                 163.41   Jarque-Bera (JB):              1262.44
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.90   Skew:                            -0.49
Prob(H) (two-sided):                  0.24   Kurtosis:                         7.14
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Getting predictions and residuals

predictions = model_fit.forecast(len(test_data))
print("len(test_data)): "+str(len(test_data)))

predictions = pd.Series(predictions, index=test_data.index)
pd.set_option('display.max_rows', None)
display(predictions)

And after 52 days I get Nan values:

len(test_data)): 201
start_date
2019-04-14    65.950757
2019-04-15    63.391462
2019-04-16    64.917660
2019-04-17    66.620774
2019-04-18    64.737917
2019-04-19    68.664719
(...)
2019-06-03    72.713157
2019-06-04    64.878058
2019-06-05          NaN
2019-06-21          NaN
2019-06-22          NaN
(...)
2019-10-29          NaN
2019-10-30          NaN
2019-10-31          NaN
Freq: D, Name: predicted_mean, dtype: float64

Do you know what could be the cause of this ?

Best Answer

Your train_data ends on 2018-11-15, but your test_data begins on 2019-04-14.

You are producing len(test_data) = 201 forecasts. However, the first 149 forecasts are for the dates from 2018-11-16 through 2019-04-13, and you discard these forecasts when you create the new pd.Series with index set equal to test_data.index, because this index is for the dates 2019-04-14 through 2019-10-31. Thus there are only 201 - 149 = 52 forecasts in your final predictions Series.

Related Question