Solved – Forecasting with ARIMA ( Training and Test Data split)

arimaforecastingfourier transformrtime series

I have an hourly time series of the average parking occupancy with data available from September 2017 up until June 2018. I would like to use the ARIMA model with external regressors to produce a forecast for the next 24 hours. The data is available here.

The external regressors that I am using are : week days(1=Monday to 7=Sunday), average traffic and the fourier terms.

This is what I have done up until now:

1) Checked the dominant frequency/frequencies in my data using the periodogram. The output was 24 (as expected) .

> library(forecast)
> out=periodogram(Parking$AvgOccupied)
> wmax=which.max(out$spec)
> freq=1/out$freq[wmax]
> 1/out$freq[wmax]
[1] 24.02402402

2) Split my data into test and training data. Even though I already have the the data for the average parking occupancy for the month of June 2018, I am using it as Test data since I would like to check the accuracy of my model against this data.

> Parking.Train=Parking[1:6552,] # From 01 Sep 2017 to 31 May 2018
> Parking.Test=Parking[6553:7272,] # From 01 Jun 2018 to 30 Jun 2018

3) Convert the training data to a ts object.

ParkingTS=ts(Parking.Train$AvgOccupied,
             frequency=24,
             start=c(as.Date("2017-09-01"))) 
ParkingTS1=ts(Parking.Test$AvgOccupied,
             frequency=24,
             start=c(as.Date("2018-06-01")))

4) Fit the model with the external regressors ( this code is courtesy Dr. Rob Hyndman (https://robjhyndman.com/hyndsight/forecasting-weekly-data/)

> bestfit=list(aicc=Inf)

> for(i in 1:11) {
 ParkingARIMA=auto.arima(ParkingTS,xreg=cbind(model.matrix(~Parking.Train$WeekDay)[,-1],
                   Parking.Train$AvgTrafficFlow,
                   forecast::fourier(ParkingTS, K=i)),seasonal=F) 
  if(ParkingARIMA$aicc < bestfit$aicc)
   {
     bestfit = ParkingARIMA
   }  else break;
 }

The resulting model is ARIMA(0,1,5) with 4 Fourier Terms.

5) I would now like to forecast the average parking occupancy for the next 24 hours using the regressors in the test data. I use the model I obtained in Step 4 and the regressors in the test data(WeekDays and Traffic Flow) + Fourier terms from test data and use them as inputs in the forecast() function with h=24. Then, compute the accuracy of the forecast using the average parking occupancy in the test data.

> ParkingForecast=forecast(bestfit,xreg=cbind(model.matrix(~Parking.Test$WeekDay)[,-1],
                                             Parking.Test$AvgTrafficFlow,
                                             forecast::fourier(ParkingTS1, K=4)))
> acc=accuracy(ParkingForecast,Parking.Test$AvgOccupied)
> acc
               ME              RMSE         MAE          MPE         MAPE         MASE          ACF1
 Training set -0.005673853141 48.64258868 31.94747327 -1.531875066  8.176109728 0.5851921293 0.02495856147
 Test set     -6.410339968260 95.59476132 66.83084303 -5.812664624 17.743429782 1.2241620176            NA

QUESTIONS:

i) Is this forecasting strategy correct? Or have I missed the mark completely?

ii) Is it correct to re- estimate the Fourier terms for the test data?

NB: I am doing the above just as an experiment. I have already modelled my data using the auto.arima() function with the external regressors as week days and traffic flow (without the Fourier terms) to get a seasonal arima model : ARIMA(3,0,3)(2,1,0)[24] with the below accuracy measures

> acc1
                     ME        RMSE         MAE          MPE         MAPE        MASE             ACF1
Training set  0.01681395761 52.63164320 32.35382066 -1.284216761  8.012784474 0.592635325 -0.0009199141052
Test set     -2.47801257238 98.98536617 61.30672355 -3.091655364 15.528942136 1.122974947               NA

Best Answer

If I understand correctly, you derive your Fourier terms from data that are only available after the test period. If you assume you can use these data, you might just as well observe the actual parking data and forecast those.

Or, in other words: no, you can only use predicted future information. For instance, you are able to predict tomorrow's weekday perfectly, so there is no problem in including the weekday. As to your parking data that you want to Fourier transform: in order to get an idea of how your algorithm performs, you will need to forecast it and Fourier transform that forecast.

Finally, you might also want to look at models that capture multiple-seasonalities directly, like bats or tbats.

Related Solutions

Solved – Time Series Forecasting with Daily Data: ARIMA with regressor

You should be evaluating models and forecasts from different origins across different horizons and not one one number in order to gauge an approach.

I assume that your data is from the US. I prefer 3+ years of daily data as you can have two holidays landing on a weekend and get no weekday read. It looks like your Thanksgiving impact is a day off in the 2012 or there was a recording error of some kind and caused the model to miss the Thanksgiving day effect.

Januarys are typically low in the dataset if you look as a % of the year. Weekends are high. The dummies reflect this behavior....MONTH_EFF01, FIXED_EFF_N10507,FIXED_EFF_N10607

I have found that using an AR component with daily data assumes that the last two weeks day of the week pattern is how the pattern is in general which is a big assumption. We started with 11 monthly dummies and 6 daily dummies. Some dropped out of the model. B**1 means that there is a lag impact the day after a holiday. There were 6 special days of the month (days 2,3,5,21,29,30----21 might be spurious?) and 3 time trends, 2 seasonal pulses (where a day of the week started deviating from the typical, a 0 before this data and a 1 every 7th day after) and 2 outliers (note the thanksgiving!) This took just under 7 minutes to run. Download all results here www.autobox.com/se/dd/daily.zip

It includes a quick and dirty XLS sheet to check to see if the model makes sense. Of course, the XLS % are in fact bad as they are crude benchmarks.

Try estimating this model:

Y(T) =  .53169E+06                                                                                        
       +[X1(T)][(+  .13482E+06B** 1)]                                       M_HALLOWEEN
       +[X2(T)][(+  .17378E+06B**-3)]                                       M_JULY4TH
       +[X3(T)][(-  .11556E+06)]                                            M_MEMORIALDAY
       +[X4(T)][(-  .16706E+06B**-4+  .13960E+06B**-3-  .15636E+06B**-2                                                 
       -  .19886E+06B**-1)]                                                 M_NEWYEARS
       +[X5(T)][(+  .17023E+06B**-2-  .26854E+06B**-1-  .14257E+06B** 1)]   M_THANKSGIVI
       +[X6(T)][(-  71726.    )]                                            MONTH_EFF01
       +[X7(T)][(+  55617.    )]                                            MONTH_EFF02
       +[X8(T)][(+  27827.    )]                                            MONTH_EFF03
       +[X9(T)][(-  37945.    )]                                            MONTH_EFF09
       +[X10(T)[(-  23652.    )]                                            MONTH_EFF10
       +[X11(T)[(-  33488.    )]                                            MONTH_EFF11
       +[X12(T)[(+  39389.    )]                                            FIXED_EFF_N10107
       +[X13(T)[(+  63399.    )]                                            FIXED_EFF_N10207
       +[X14(T)[(+  .13727E+06)]                                            FIXED_EFF_N10307
       +[X15(T)[(+  .25144E+06)]                                            FIXED_EFF_N10407
       +[X16(T)[(+  .32004E+06)]                                            FIXED_EFF_N10507
       +[X17(T)[(+  .29156E+06)]                                            FIXED_EFF_N10607
       +[X18(T)[(+  74960.    )]                                            FIXED_DAY02
       +[X19(T)[(+  39299.    )]                                            FIXED_DAY03
       +[X20(T)[(+  27660.    )]                                            FIXED_DAY05
       +[X21(T)[(-  33451.    )]                                            FIXED_DAY21
       +[X22(T)[(+  43602.    )]                                            FIXED_DAY29
       +[X23(T)[(+  68016.    )]                                            FIXED_DAY30
       +[X24(T)[(+  226.98    )]                                            :TIME TREND        1                   1/  1   1/ 3/2011   I~T00001__010311stack
       +[X25(T)[(-  133.25    )]                                            :TIME TREND      423                  61/  3   2/29/2012   I~T00423__010311stack
       +[X26(T)[(+  164.56    )]                                            :TIME TREND      631                  91/  1   9/24/2012   I~T00631__010311stack
       +[X27(T)[(-  .42528E+06)]                                            :SEASONAL PULSE  733                 105/  5   1/ 4/2013   I~S00733__010311stack
       +[X28(T)[(-  .33108E+06)]                                            :SEASONAL PULSE  370                  53/  6   1/ 7/2012   I~S00370__010311stack
       +[X29(T)[(-  .82083E+06)]                                            :PULSE           326                  47/  4  11/24/2011   I~P00326__010311stack
       +[X30(T)[(+  .17502E+06)]                                            :PULSE           394                  57/  2   1/31/2012   I~P00394__010311stack
      +                    +   [A(T)]

Solved – time series forecasting using auto.arima and exponential smoothing

Seasonality is probably not very strong. Different algorithms will give different results, unless seasonality is glaringly obvious.
The best measure is always to compare forecast accuracy on a holdout set: hold back the last $n$ observations, fit your models to all other observations, forecast into the last $n$ time periods with both models, then compare forecast accuracy using your error measure of choice (see 5 below).
Yes, this is a common complaint. I don't think there is an easy way to get the in-sample fit. But you can get the residuals: auto.arima(WWWusage)$residuals. Best to look into the code of auto.arima() to see whether you need to add or subtract them from the original series to get the fit. I'd say you have to subtract ("actuals=model+residuals"), but better check.
I recommend a good forecasting textbook. This is a very good start. Otherwise, read through the help pages.
The appropriate error measure will depend on your personal loss function. Is your pain symmetric, and will it increase more strongly with larger errors? Then use MSE. Is your pain proportional to absolute errors? Then use MAE. Best to look at multiple error measures.

One tip: averaging forecasts will usually improve accuracy. Consider taking the average of your two models' forecasts per future time bucket.
auto.arima() apparently fits no drift, even if you allow it.

Best Answer

Related Solutions

Solved – Time Series Forecasting with Daily Data: ARIMA with regressor

Solved – time series forecasting using auto.arima and exponential smoothing

Related Question