Solved – Time Series Forecasting in R

accuracyarimaforecastingregressiontime series

I have a dataset composed by: Date, Cash, NumberOfAccountsperMonth. The frequency of the data is monthly.

I'd like to forecast Cash for the next 6 months with R, and so far I don't know which method is the best to go with.

On one hand, I just create a TimeSeries for Cash with the ts() formula, then proceed with the auto.arima() formula and get a forecast from ARIMA(0,1,1)(1,0,0)[12] since data has seasonality and trend.

On the other hand, I know that NumberofAccounts does influence Cash, so I've built a linear regression model for my time series with the tslm() formula and then I proceeded with the forecast.

The problem is that I'm getting very different results. Could anyone tell me which way to go?

Here's my code and the results

tsIncassi <- ts(Cash, start = c(2008,01), end=c(2017,10), frequency =12)
fit.arima <- auto.arima(tsCash)
summary(fit.arima)
Series: tsCash
ARIMA(0,1,1)(1,0,0)[12] with drift 

Coefficients:
          ma1    sar1     drift
      -0.7296  0.3983  7505.999
s.e.   0.0540  0.0910  2092.337

sigma^2 estimated as 2.804e+09:  log likelihood=-1438.54
AIC=2885.08   AICc=2885.44   BIC=2896.13

Training set error measures:
                    ME     RMSE     MAE       MPE     MAPE      MASE
Training set -237.2423 52052.09 36481.3 -55.44956 66.78608 0.4086746
                    ACF1
Training set -0.06202615

Plot

TS Regression code:

fit.tsreg <- tslm(tsCash ~ NumberAccounts + trend + season)
fcast.tsreg <- forecast(fit.tsreg, newdata = data.frame(NumberAccounts=NumberAccounts))
summary(fit.tsreg)

Call:
tslm(formula = tsCash ~ NumberAccounts + trend + season)

Residuals:
    Min      1Q  Median      3Q     Max 
-135019  -47778   -1129   41334  220754 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.376e+05  3.752e+04  -8.998 1.16e-14 ***
NumberAccou  4.482e-01  6.249e-02   7.171 1.12e-10 ***
trend        7.786e+03  2.682e+02  29.026  < 2e-16 ***
season2      3.626e+04  3.260e+04   1.112   0.2686    
season3      4.329e+04  3.241e+04   1.336   0.1845    
season4      3.826e+04  3.264e+04   1.172   0.2438    
season5      1.062e+04  3.243e+04   0.327   0.7440    
season6      4.519e+04  3.265e+04   1.384   0.1693    
season7      1.757e+04  3.242e+04   0.542   0.5889    
season8      1.634e+03  3.264e+04   0.050   0.9602    
season9      8.869e+03  3.243e+04   0.273   0.7850    
season10     5.904e+04  3.268e+04   1.806   0.0737 .  
season11    -4.469e+03  3.330e+04  -0.134   0.8935    
season12     8.474e+04  3.362e+04   2.520   0.0132 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 72430 on 104 degrees of freedom
Multiple R-squared:  0.9174,    Adjusted R-squared:  0.9071 
F-statistic: 88.84 on 13 and 104 DF,  p-value: < 2.2e-16

Plot

Here you can see the accuracy() results for both procedures

    accuracy(fcast.arima)
                       ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
    Training set 1886.528 48855.13 32553.28 -36.81754 55.17642 0.4766057 0.1147092

accuracy(fcast.tsreg)
                       ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
Training set 3.183231e-12 48007.12 37345.96 10.32393 130.0177 0.5467744 0.07845181

If the ARIMA forecast seems to be more accurate, why is that? Since I taking in consideration an independent variable that I know for sure that influences my dependent variable, shouldn't the forecast on the regression be more accurate?

Best Answer

accuracy() in your example only looks at in-sample accuracy. In-sample fit is a notoriously poor guide to out-of-sample accuracy.

Use a holdout sample instead: keep back the last $N$ observations, fit your models to the ones before that, forecast into the holdout sample and evaluate these forecasts.

In addition, you can feed external data into the xreg parameter for auto.arima(). This will fit a regression with ARIMA errors.