Solved – Multiple imputation for missing count data in a time series from a panel study

data-imputationpanel datartime series

I am trying to tackle a problem which deals with the imputation of missing data from a panel data study(Not sure if I am using 'panel data study' correctly – as I learned it today.) I have total death count data for years 2003 to 2009, all the months, male & female, for 8 different districts and for 4 age groups.

The dataframe looks something like this:

         District  Gender Year Month    AgeGroup TotalDeaths
         Northern    Male 2006    11        01-4           0
         Northern    Male 2006    11       05-14           1
         Northern    Male 2006    11         15+          83
         Northern    Male 2006    12           0           3
         Northern    Male 2006    12        01-4           0
         Northern    Male 2006    12       05-14           0
         Northern    Male 2006    12         15+         106
         Southern  Female 2003     1           0           6
         Southern  Female 2003     1        01-4           0
         Southern  Female 2003     1       05-14           3
         Southern  Female 2003     1         15+         136
         Southern  Female 2003     2           0           6
         Southern  Female 2003     2        01-4           0
         Southern  Female 2003     2       05-14           1
         Southern  Female 2003     2         15+         111
         Southern  Female 2003     3           0           2
         Southern  Female 2003     3        01-4           0
         Southern  Female 2003     3       05-14           1
         Southern  Female 2003     3         15+         141
         Southern  Female 2003     4           0           4

For the 10 months spread over 2007 and 2008 some of the total deaths from all districts were not recorded. I am trying to estimate these missing value through a multiple imputation method. Either using Generalized Linear Models or SARIMA models.

My biggest issue is the use of software and the coding. I asked a question on Stackoverflow, where I want to extract the data into smaller groups such as this:

         District  Gender Year Month    AgeGroup TotalDeaths
         Northern    Male 2003     1        01-4           0
         Northern    Male 2003     2        01-4           1
         Northern    Male 2003     3        01-4           0
         Northern    Male 2003     4        01-4           3
         Northern    Male 2003     5        01-4           4
         Northern    Male 2003     6        01-4           6
         Northern    Male 2003     7        01-4           5
         Northern    Male 2003     8        01-4           0
         Northern    Male 2003     9        01-4           1
         Northern    Male 2003    10        01-4           2
         Northern    Male 2003    11        01-4           0
         Northern    Male 2003    12        01-4           1
         Northern    Male 2004     1        01-4           1
         Northern    Male 2004     2        01-4           0

Going to

         Northern    Male 2006    11        01-4           0
         Northern    Male 2006    12        01-4           0

But someone suggested I should rather bring my question here – perhaps ask for a direction? Currently I am unable to enter this data as a proper time-series/panel study into R. My eventual aim is to use this data and the amelia2 package with its functions to impute for missing TotalDeaths for certain months in 2007 and 2008, where the data is missing.

Any help, how to do this and perhaps suggestions on how to tackle this problem would be gratefully appreciated.

If this helps, I am trying to follow a similar approach to what Clint Roberts did in his PhD Thesis.

EDIT:

After creating the 'time' and 'group' variable as suggested by @Matt:

> head(dat)
     District Gender Year Month AgeGroup Unnatural Natural Total time                    group
1 Khayelitsha Female 2001     1        0         0       6     6    1     Khayelitsha.Female.0
2 Khayelitsha Female 2001     1     01-4         1       3     4    1  Khayelitsha.Female.01-4
3 Khayelitsha Female 2001     1    05-14         0       0     0    1 Khayelitsha.Female.05-14
4 Khayelitsha Female 2001     1     15up         8      73    81    1  Khayelitsha.Female.15up
5 Khayelitsha Female 2001     2        0         2       9    11    2     Khayelitsha.Female.0
6 Khayelitsha Female 2001     2     01-4         0       2     2    2  Khayelitsha.Female.01-4

As you notice, there's actually further detail 'Natural' and 'Unnatural'.

Best Answer

You can use the Amelia package to impute the data (full disclosure: I am one of the authors of Amelia). The package vignette has an extended example of how to use it to impute missing data.

It seems as though you have units which are district-gender-ageGroup observed at the monthly level. First you create a factor variable for each type of unit (that is, one level for each district-gender-ageGroup). Let's call this group. Then, you would need a variable for time, which is probably the number of months since January 2003. Thus, this variable would be 13 in January of 2004. Call this variable time. Amelia will allow you to impute based on the time trends with the following commands:

library(Amelia)
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE)

The ts and cs arguments simply denote the time and unit variables. The splinetime argument sets how flexible should time be used to impute the missing data. Here, a 2 means that the imputation will use a quadratic function of time, but higher values will be more flexible. The intercs argument here tells Amelia to use a separate time trend for each district-gender-ageGroup. This adds many parameters to the model, so if you run into trouble, you can set this to FALSE to try to debug.

In any event, this will get you imputations using the time information in your data. Since the missing data is bounded at zero, you can use the bounds argument to force imputations into those logical bounds.

EDIT: How to create group/time variables

The time variable might be the easiest to create, because you just need to count from 2002 (assuming that is the lowest year in your data):

my.data$time <- my.data$Month + 12 * (my.data$Year - 2002)

The group variable is slightly harder but a quick way to do it is using the paste command:

my.data$group <- with(my.data, 
                      as.factor(paste(District, Gender, AgeGroup, sep = ".")))

With these variables created, you want to remove the original variables from the imputation. To do that you can use the idvars argument:

a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE,
                idvars = c("District", "Gender", "Month", "Year", "AgeGroup"))

Related Solutions

Solved – How to handle nonexistent or missing data

My suggestion is similar to what you propose except that I would use a time series model instead of moving averages. The framework of ARIMA models is also suitable to obtain forecast including not only the series MSCI as a regressor but also lags of the GCC series that may also capture the dynamics of the data.

First, you may fit an ARIMA model for the series MSCI and interpolate the missing observations in this series. Then, you may fit an ARIMA model for the series GCC using MSCI as exogenous regressors and obtain the forecasts for GCC based on this model. In doing this, you must be careful dealing with the breaks that are graphically observed in the series and that may distort the selection and fit of the ARIMA model.

Here is what I get doing this analysis in R. I use the function forecast::auto.arima to make the selection of the ARIMA model and tsoutliers::tso to detect possible level shifts (LS), temporary changes (TC) or additive outliers (AO).

These are the data once loaded:

gcc <- structure(c(117.709, 120.176, 117.983, 120.913, 134.036, 145.829, 143.108, 149.712, 156.997, 162.158, 158.526, 166.42, 180.306, 185.367, 185.604, 200.433, 218.923, 226.493, 230.492, 249.953, 262.295, 275.088, 295.005, 328.197, 336.817, 346.721, 363.919, 423.232, 492.508, 519.074, 605.804, 581.975, 676.021, 692.077, 761.837, 863.65, 844.865, 947.402, 993.004, 909.894, 732.646, 598.877, 686.258, 634.835, 658.295, 672.233, 677.234, 491.163, 488.911, 440.237, 486.828, 456.164, 452.141, 495.19, 473.926, 
492.782, 525.295, 519.081, 575.744, 599.984, 668.192, 626.203, 681.292, 616.841, 676.242, 657.467, 654.66, 635.478, 603.639, 527.326, 396.904, 338.696, 308.085, 279.706, 252.054, 272.082, 314.367, 340.354, 325.99, 326.46, 327.053, 354.192, 339.035, 329.668, 318.267, 309.847, 321.98, 345.594, 335.045, 311.363, 
299.555, 310.802, 306.523, 315.496, 324.153, 323.256, 334.802, 331.133, 311.292, 323.08, 327.105, 320.258, 312.749, 305.073, 297.087, 298.671), .Tsp = c(2002.91666666667, 2011.66666666667, 12), class = "ts")
msci <- structure(c(1000, 958.645, 1016.085, 1049.468, 1033.775, 1118.854, 1142.347, 1298.223, 1197.656, 1282.557, 1164.874, 1248.42, 1227.061, 1221.049, 1161.246, 1112.582, 929.379, 680.086, 516.511, 521.127, 487.562, 450.331, 478.255, 560.667, 605.143, 598.611, 609.559, 615.73, 662.891, 655.639, 628.404, 602.14, 601.1, 622.624, 661.875, 644.751, 588.526, 587.4, 615.008, 606.133, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 609.51, 598.428, 595.622, 582.905, 599.447, 627.561, 619.581, 636.284, 632.099, 651.995, 651.39, 687.194, 676.76, 694.575, 704.806, 727.625, 739.842, 759.036, 787.057, 817.067, 824.313, 857.055, 805.31, 873.619), .Tsp = c(2007.33333333333, 2014.5, 12), class = "ts")

Step 1: Fit an ARIMA model to the series MSCI

Despite the graphic reveals the presence of some breaks, no outliers were detected by tso. This may be due to the fact that there are several missing observations in the middle of the sample. We can deal with this in two steps. First, fit an ARIMA model and use it to interpolate missing observations; second, fit an ARIMA model for the interpolated series checking for possible LS, TC, AO and refine the interpolated values if changes are found.

Choose ARIMA model for the series MSCI:

require("forecast")
fit1 <- auto.arima(msci)
fit1
# ARIMA(1,1,2) with drift         
# Coefficients:
#           ar1     ma1     ma2    drift
#       -0.6935  1.1286  0.7906  -1.4606
# s.e.   0.1204  0.1040  0.1059   9.2071
# sigma^2 estimated as 2482:  log likelihood=-328.05
# AIC=666.11   AICc=666.86   BIC=678.38

Fill missing observations following the approach discussed in my answer to this post:

kr <- KalmanSmooth(msci, fit1$model)
tmp <- which(fit1$model$Z == 1)
id <- ifelse (length(tmp) == 1, tmp[1], tmp[2])
id.na <- which(is.na(msci))
msci.filled <- msci
msci.filled[id.na] <- kr$smooth[id.na,id]

Fit an ARIMA model to the filled series msci.filled. Now some outliers are found. Nevertheless, using alternative options different outliers were detected. I will keep the one that was found in most cases, a level shift at October 2008 (observation 18). You can try for example these and other options.

require("tsoutliers")
tso(msci.filled, remove.method = "bottom-up", tsmethod = "arima", 
  args.tsmethod = list(order = c(1,1,1)))
tso(msci.filled, remove.method = "bottom-up", args.tsmethod = list(ic = "bic"))

The chosen model is now:

mo <- outliers("LS", 18)
ls <- outliers.effects(mo, length(msci))
fit2 <- auto.arima(msci, xreg = ls)
fit2
# ARIMA(2,1,0)                    
# Coefficients:
#           ar1     ar2       LS18
#       -0.1006  0.4857  -246.5287
# s.e.   0.1139  0.1093    45.3951
# sigma^2 estimated as 2127:  log likelihood=-321.78
# AIC=651.57   AICc=652.06   BIC=661.39

Use the previous model to refine the interpolation of missing observations:

kr <- KalmanSmooth(msci, fit2$model)
tmp <- which(fit2$model$Z == 1)
id <- ifelse (length(tmp) == 1, tmp[1], tmp[2])
msci.filled2 <- msci
msci.filled2[id.na] <- kr$smooth[id.na,id]

The initial and the final interpolations can be compared in a plot (not shown here to save space):

plot(msci.filled, col = "gray")
lines(msci.filled2)

Step 2: Fit an ARIMA model to GCC using msci.filled2 as exogenous regressor

I ignore the missing observations at the beginning of msci.filled2. At this point I found some difficulties to use auto.arima along with tso, so I tried by hand several ARIMA models in tso and finally chose the ARIMA(1,1,0).

xreg <- window(cbind(gcc, msci.filled2)[,2], end = end(gcc))
fit3 <- tso(gcc, remove.method = "bottom-up", tsmethod = "arima",  
  args.tsmethod = list(order = c(1,1,0), xreg = data.frame(msci=xreg)))
fit3
# ARIMA(1,1,0)                    
# Coefficients:
#           ar1    msci     AO72
#       -0.1701  0.5131  30.2092
# s.e.   0.1377  0.0173   6.7387
# sigma^2 estimated as 71.1:  log likelihood=-180.62
# AIC=369.24   AICc=369.64   BIC=379.85
# Outliers:
#   type ind    time coefhat tstat
# 1   AO  72 2008:11   30.21 4.483

The plot of GCC shows a shift at the beginning 2008. However, it seems that it was already captured by the regressor MSCI and no additonal regressors were included except an additive outlier at November 2008.

The plot of the residuals did not suggest any autocorrelation structure but the plot suggested a level shift at November 2008 and an additive outlier at February 2011. However, adding the corresponding interventions the diagnostic of the model was worse. Further analysis may be needed at this point. Here, I will continue obtaining the forecasts based on the last model fit3.

The forecasts can be easily obtained. The plot below displays the original series, the interpolated values for MSCI and the forecast along with the $95\%$ confidence intervals for GCC. The confindence intervals does not account to the uncertainty in the values tht were interpolated in MSCA.

newxreg <- data.frame(msci=window(msci.filled2, start = c(2011, 10)), AO72=rep(0, 34))
p <- predict(fit3$fit, n.ahead = 34, newxreg = newxreg)
head(p$pred)
# [1] 298.3544 298.2753 298.0958 298.0641 297.6829 297.7412
par(mar = c(3,3.5,2.5,2), las = 1)
plot(cbind(gcc, msci), xaxt = "n", xlab = "", ylab = "", plot.type = "single", type = "n")
grid()
lines(gcc, col = "blue", lwd = 2)
lines(msci, col = "green3", lwd = 2)
lines(window(msci.filled2, start = c(2010, 9), end = c(2012, 7)), col = "green", lwd = 2)
lines(p$pred, col = "red", lwd = 2)
lines(p$pred + 1.96 * p$se, col = "red", lty = 2)
lines(p$pred - 1.96 * p$se, col = "red", lty = 2)
xaxis1 <- seq(2003, 2014)
axis(side = 1, at = xaxis1, labels = xaxis1)
legend("topleft", col = c("blue", "green3", "green", "red", "red"), lwd = 2, bty = "n", lty = c(1,1,1,1,2), legend = c("GCC", "MSCI", "Interpolated values", "Forecasts", "95% confidence interval"))

Solved – Imputation methods for time series data

Your approach sounds very theoretical. Did you analyze the imputations of the packages you mentioned?

Often imputation packages have requirements (e.g. MCAR data), but will still do a reasonable good job on data not fulfilling these conditions.

Only a actual test and comparison of algorithms will show you which one is best suited for your data.

The testing procedure can look like this:

Find a interval with no (or very few) missing data
Artificially add missing data in this interval. (these should resemble the NA patterns in the rest of the data)
Apply different imputation methods to this dataset. (e.g. methods from imputeTS, mtsdi, AMELIA)
Since you have the real values for your artificially deleted NA values, you can now compare how good alle the algorithms did on your data

Additional info:

The Amelia package also has some options to support the imputation of multivariate time series (see in the manual under 4.6)
Also other packages like mice could be tried

In general if you have multivariate time series, this means you have correlations between your different variables plus you have correlations of each variable in the time axis. (here is a talk from useR! 2017 conference which among other things explains this)

In theory it sounds like it would make most sense if you try to use both of the correlations. But if the correlations in time is for example very strong, univariate time series imputation methods from imputeTS might even work best.

On the other hand, if the correlation between your variables is very strong, non time series imputation packages could work best. (like mice, VIM, missMDA and others)

Best Answer

Related Solutions

Solved – How to handle nonexistent or missing data

Solved – Imputation methods for time series data

Related Question