Solved – Fitting ARIMA to time series with missing values

arimadatasetmissing datartime series

I have a stationary time series object (e.g.: xts) consisting of weekly continuous data. Values are missing for several weeks, sometimes randomly but often in chunks of 4-5 weeks. I want to fit a time series model to the data for forecasting using the "arima" function.

Does the function "arima" take into account the missing weeks? Looks like it does not use the time index at all! Here's my code:

o.time.pos <- seq(1:(52*5))
z.idx <- seq.Date(as.Date("2010/1/1"), by="week", length.out = 52*5)
sigma <- 1.15
phi <- 0.8
y_ts <- arima.sim(n = length(o.time.pos), list(ar = c(phi)), sd = sigma)
y.ts <- xts(as.numeric(y_ts), order.by=z.idx)
y.ts.na <- y.ts
# Missing values as NAs
y.ts.na[c(40:45, 72:82)] <- NA
ar1 <- arima(y.ts.na, order=c(2,1,2), method="ML")
y.ts.na1 <- y.ts
# Missing values are deleted from the time series. However, the time 
# index shows that there are weeks missing
y.ts.na1 <- y.ts.na1[-c(40:45, 72:82)]
y.ts.na1
2010-09-10 -0.341071731
2010-09-17 -2.141615586
2010-09-24 -1.538637593
2010-11-12 -2.801102613
2010-11-19 -2.447482778
2010-11-26 -3.176720246
2010-12-03 -2.532530896
ar2 <- arima(y.ts.na1, order=c(2,1,2), method="ML")

I expect ar1 and ar2 to be same but they are not.

summary(ar1)

Call:
arima(x = y.ts.na, order = c(2, 1, 2), method = "ML")

Coefficients:
         ar1     ar2      ma1      ma2
      0.4367  0.1801  -0.6410  -0.3319
summary(ar2)

Call:
arima(x = y.ts.na1, order = c(2, 1, 2), method = "ML")

Coefficients:
         ar1     ar2      ma1      ma2
      0.5302  0.1231  -0.7473  -0.2298

It seems in the second method, even though the xts object has information on the missing weeks, the "arima" method does not seem to use this and instead treat the time as contiguous. In the above example, it seems to treat the data for 2010-11-12 as the data for the week after 2010-09-24 and so on. This is clearly wrong. I understand that putting together the likelihood with missing data is not possible. What are my options to fit a time series model to data with missing data?

I know one method is to impute the data (using, for example, How to use auto.arima to impute missing values or How do I handle nonexistent or missing data?) and then fit but is it possible to fit without imputation?

Best Answer

The results given by stats::arima in the first approach (ar1) are correct: they have taken into account the missing values. In the second one, they have not.

You can fit ARIMA models with missing values easily because all ARIMA models are state space models and the Kalman filter, which is used to fit state space models, deals with missing values exactly by simply skipping the update phase. So, "putting together the likelihood with missing data" is absolutely possible, as is done by the Kalman filter. Any other state space model will allow you to do the same.

Unless you are specifically interested in an estimate of those missing values, you do not need to impute them. If you do so incorrectly, you could distort the dynamics, which would cause problems when trying to fit your model afterwards. If you only want to forecast the series, you should probably not impute them.

The question of why ar1 is correct but not ar2 is not exactly on topic here, but for the record: stats::arima expects your data as an object of class ts, not xts. If your data isn't ts, it will be converted by using as.ts, which discards the date information; this means that the explicit NA's in the first approach are retained, while the implicit ones in the second will not appear at all and it will indeed just glue the series together. The reason why stats::arima expects an object of class ts is because that class enforces regularly sampled data (at a certain frequency), whereas xts can carry arbitrarily sampled data, and classical ARIMA models are defined for regularly sampled data only.