Solved – Using the R forecast package with missing values and/or irregular time series

forecastingmissing datartime seriesunevenly-spaced-time-series

I am impressed by the R forecast package, as well as e.g. the zoo package for irregular time series and interpolation of missing values.

My application is in the area of call center traffic forecasting, so data on weekends is (nearly) always missing, which can be nicely handled by zoo. Also, some discrete points may be missing, I just use R's NA for that.

The thing is: all the nice magic of the forecast package, such as eta(), auto.arima() etc, seem to expect plain ts objects, i.e. equispaced time series not containing any missing data. I think real world applications for equispaced-only time series are definitely existent, but – to my opinion – v e r y limited.

The problem of a few discrete NA values can easily be solved by using any of the offered interpolation functions in zoo as well as by forecast::interp. After that, I run the forecast.

My questions:

  1. Does anyone suggest a better solution?
  2. (my main question) At least in my application domain, call center traffic forecasting (and as far as I can imagine most other problem domains), time series are not equispaced. At least we have recurring "business days" scheme or something. What's the best way to handle that and still use all the cool magic of the forecast package?

    Should I just "compress" the time series to fill the weekends, do the forecast, and then "inflate" the data again to re-insert NA values in the weekends? (That would be a shame, I think?)

    Are there any plans to make the forecast package fully compatible with irregular time series packages like zoo or its? If yes, when and if no, why not?

I'm quite new to forecasting (and statistics in general), so I might overlook something important.

Best Answer

You should be very careful when you apply interpolation before further statistical treatment. The choice you do for your interpolation introduces a bias into your data. This is something you definitely want to avoid, as it could alter the quality of your predictions. In my opinion for missing values such as those you mentioned, that are regularly spaced in time and that correspond to a stop in the activities, it might be more correct to leave these days out of your model. In the the little world of your call center (the model you are building about it), it might be better to consider that time simply stopped when it is closed instead of inventing measurements of a non-existing activity. On the other hand the ARIMA model has been statistically built on the assumption that data is equally spaced. As far as I know there is no adaptation of ARIMA to your case. If you are just missing a few measurements on actual working days, you might be forced to use interpolation.