Solved – Including seasonality in ARIMA model using xreg

rtime series

I have a very small data set 12 observations using which I wish to generate forecasts for the next 12 months :

Date    Paid    Christmas   MonthNum    Month
   Jan-15   11990085    0   1   1
   Feb-15   11061740    0   2   2
   Mar-15   12076397    0   3   3
   Apr-15   11702514    0   4   4
   May-15   11395657    0   5   5
   Jun-15   11817594    0   6   6
   Jul-15   11643682    0   7   7
   Aug-15   10243241    0   8   8
   Sep-15   12233001    0   9   9
   Oct-15   11769231    0   10  10
   Nov-15   12652418    0   11  11
   Dec-15   9774333 1   12  12

I want to run auto.arima. In order to incorporate seasonal variables I used the following code:

   xreg <- cbind(Month=model.matrix(~as.factor(mydata$MonthNum)), 
          Holiday=mydata$Month,
          Christmas=mydata$Christmas)  
   # Remove intercept
   xreg <- xreg[,-1]  

 # Rename columns
  colnames(xreg) <- 
  c("Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"
                ,"Holiday","Christmas")

  # Variable to be modelled
  paid <- ts(mydata$Paid, frequency=12)

  # Find ARIMAX model
  modArima <- auto.arima(paid , xreg=xreg)

But I end up getting the following error message:

  Error in auto.arima(paid, xreg = xreg) : xreg is rank deficient

Is this because of the size of the data set?

Would be great if someone can help out.

Best Answer

Your regressor matrix is rank deficient, as the error message says. This means that there is some kind of redundancy in your regressors. For one, your "Christmas" regressor is identical to the last column of the matrix that represents the 1-hot encoding of the "MonthNum" column (model.matrix(~as.factor(mydata$MonthNum)), which is:

$$\left[\begin{array}{l} 1&0&0&0&0&0&0&0&0&0&0&0\\ 1&1&0&0&0&0&0&0&0&0&0&0\\ 1&0&1&0&0&0&0&0&0&0&0&0\\ 1&0&0&1&0&0&0&0&0&0&0&0\\ 1&0&0&0&1&0&0&0&0&0&0&0\\ 1&0&0&0&0&1&0&0&0&0&0&0\\ 1&0&0&0&0&0&1&0&0&0&0&0\\ 1&0&0&0&0&0&0&1&0&0&0&0\\ 1&0&0&0&0&0&0&0&1&0&0&0\\ 1&0&0&0&0&0&0&0&0&1&0&0\\ 1&0&0&0&0&0&0&0&0&0&1&0\\ 1&0&0&0&0&0&0&0&0&0&0&1\\ \end{array}\right]$$

You don't need that "Christmas" regressor at all, and intuitively you won't be able to disentangle the effect of the holiday from anything else that is December-specific because you don't have any Decembers that don't also have a Christmas in them.

After that, you can't include a constant because that constant and the same 11 columns related to "MonthNum" will sum up to the "Holiday" column (which is really a "trend" regressor). At that point it doesn't really matter which you drop (trend or constant), because your model will fit your data exactly, and auto.arima won't be able to find anything more useful than white noise. Your model will be grossly overfit and it will not generalize to the next year very well at all (forecasts will be bad).

This is somewhat to be expected: you are trying to learn about each month separately but you have only one example of each month, so you can't learn too much. You can improve this by reducing the model. If you don't have enough data to estimate seasonality, maybe you'd be better off dropping it from the model entirely. If you have some out-of-data knowledge that December is different (you probably do if you included a Christmas regressor), then maybe you just want a separate mean for December, not one for each month.