Solved – Time Series Regression using dumthe variables and fpp package

rregressiontime series

I want to solve the first exercice of the Multiple Regression Chapter of R. Hyndman's online book on Time Series Forecasting (see https://www.otexts.org/fpp/5/8). I use R with fpp package as wanted in the exercise.

I am blocked in the following question:
c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

Indeed, I don't know how to make the function tslm work with my dummy vector for the surfing festival. Here is my code.

library(fpp)
log_fancy = log(fancy)
dummy_fest_mat = matrix(0, nrow=84, ncol=1)
for(h in 1:84)
    if(h%%12 == 3)   #this loop builds a vector of length 84 with
        dummy_fest_mat[h,1] = 1   #1 corresponding to each month March
dummy_fest_mat[3,1] = 0 #festival started one year later

dummy_fest = ts(dummy_fest_mat, freq = 12, start=c(1987,1))
fit = tslm(log_fancy ~ trend + season + dummy_fest)

When I do summary(fit), I see that the regression coefficients have been well calculated, but when I continue with forecast(fit)
I get the following error :

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  variables have not equal length (found for 'factor(dummy_fest)')
In addition: Warning message:
'newdata' had 50 rows but variables found have 84 rows 

But what is even stranger is that when I do forecast(fit, h=84), it works!!
I don't know what is happening here, can someone explain me?

Best Answer

First of all, you should be in the habit of keeping your datasets in data.frames:

library(fpp)
log_fancy = log(fancy)
dummy_fest = rep(0, length(fancy))
dummy_fest[seq_along(dummy_fest)%%12 == 3] <- 1
dummy_fest[3] <- 0 #festival started one year later
dummy_fest = ts(dummy_fest, freq = 12, start=c(1987,1))
my_data <- data.frame(
  log_fancy,
  dummy_fest
)
fit = tslm(log_fancy ~ trend + season + dummy_fest, data=my_data)

This is cleaner than leaving all your data lying around in the global environment, and will help prevent bugs in your analysis.

However, when we go to forecast our dataset, we get an error:

forecast(fit)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  variable lengths differ (found for 'dummy_fest')
In addition: Warning message:
'newdata' had 10 rows but variables found have 84 rows 

Interestingly, if we omit the dummy_fest variable, the forecast works fine:

fit2 = tslm(log_fancy ~ trend + season, data=my_data)
forecast(fit2)
         Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
Jan 1994       9.509264  9.246995  9.771532  9.105002  9.913525
Feb 1994       9.782700  9.520432 10.044969  9.378439 10.186962
Mar 1994      10.249256  9.986988 10.511525  9.844995 10.653518
Apr 1994       9.959377  9.697108 10.221645  9.555115 10.363638
May 1994      10.006830  9.744562 10.269098  9.602569 10.411091
Jun 1994      10.068191  9.805923 10.330459  9.663930 10.472452
Jul 1994      10.251837  9.989569 10.514105  9.847576 10.656099
Aug 1994      10.251367  9.989099 10.513635  9.847106 10.655628
Sep 1994      10.354752 10.092484 10.617020  9.950491 10.759014
Oct 1994      10.454834 10.192566 10.717102 10.050573 10.859095

What's going on here?

The answer, of course, is that while the forecast function is very smart and knows how to extrapolate your trend and season variables, it unfortunately knows nothing about surfing festivals in eastern Australia.

You need to tell the forecast function when the surfing festival will occur in the future!

For example, here's a forecast assuming the surfing festival is cancelled, and never happens again:

future_data <- data.frame(
  dummy_fest = rep(0, 12)
)
forecast(fit, newdata=future_data)
         Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
Mar 1994       9.801475  9.461879 10.141071  9.277961 10.32499
Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557

You'll probably want to edit future_data to include 1's when you think the surfing festival will occur in the future.

Related Question