I have a dataset of bike renting between 2014 and 2020, and I want to use data between 2014 and 2018 to predict future usage, and to test it on 2019.
However when using predict, I get Nan values after 52 days:
The dataset looks like this:
Geting training and testing sets:
train_end = datetime(2018, 11, 15)
train_data = lim_station_bike_demand[:train_end]
test_begin = datetime(2019, 4, 14)
test_end = datetime(2019, 10, 31)
test_data = lim_station_bike_demand[test_begin:test_end]
Fitting the model:
my_order = (0, 1, 0)
my_seasonal_order = (1, 0, 1, 365)
model = SARIMAX(train_data, order = my_order, seasonal_order = my_seasonal_order)
model_fit = model.fit()
print(model_fit.summary())
Results:
SARIMAX Results
=============================================================================================
Dep. Variable: trip_count No. Observations: 1676
Model: SARIMAX(0, 1, 0)x(1, 0, [1], 365) Log Likelihood -8626.412
Date: Fri, 26 Nov 2021 AIC 17258.824
Time: 01:36:39 BIC 17275.094
Sample: 04-15-2014 HQIC 17264.852
- 11-15-2018
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.S.L365 -0.6759 0.437 -1.546 0.122 -1.533 0.181
ma.S.L365 0.7154 0.430 1.662 0.097 -0.128 1.559
sigma2 1739.1146 34.307 50.692 0.000 1671.874 1806.355
===================================================================================
Ljung-Box (L1) (Q): 163.41 Jarque-Bera (JB): 1262.44
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.90 Skew: -0.49
Prob(H) (two-sided): 0.24 Kurtosis: 7.14
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Getting predictions and residuals
predictions = model_fit.forecast(len(test_data))
print("len(test_data)): "+str(len(test_data)))
predictions = pd.Series(predictions, index=test_data.index)
pd.set_option('display.max_rows', None)
display(predictions)
And after 52 days I get Nan values:
len(test_data)): 201
start_date
2019-04-14 65.950757
2019-04-15 63.391462
2019-04-16 64.917660
2019-04-17 66.620774
2019-04-18 64.737917
2019-04-19 68.664719
(...)
2019-06-03 72.713157
2019-06-04 64.878058
2019-06-05 NaN
2019-06-21 NaN
2019-06-22 NaN
(...)
2019-10-29 NaN
2019-10-30 NaN
2019-10-31 NaN
Freq: D, Name: predicted_mean, dtype: float64
Do you know what could be the cause of this ?
Best Answer
Your
train_data
ends on 2018-11-15, but yourtest_data
begins on 2019-04-14.You are producing
len(test_data)
= 201 forecasts. However, the first 149 forecasts are for the dates from 2018-11-16 through 2019-04-13, and you discard these forecasts when you create the newpd.Series
with index set equal totest_data.index
, because this index is for the dates 2019-04-14 through 2019-10-31. Thus there are only 201 - 149 = 52 forecasts in your finalpredictions
Series.