Suppose I have a dataframe consisting of six time series. In this dataframe, some observations are missing, meaning at some timepoints all time series contain a NA-value. In R, one possible imputation package that can be used to impute time series data is Amelia. However, this package does not work for observations that are completely missing. Are there other ways to impute my data? For what it's worth, the amount of missing observations is less than 20% of all observations.
Solved – Imputing missing observation in multivariate time series
data-imputationmultivariate analysisrtime series
Related Solutions
You can use the Amelia
package to impute the data (full disclosure: I am one of the authors of Amelia
). The package vignette has an extended example of how to use it to impute missing data.
It seems as though you have units which are district-gender-ageGroup observed at the monthly level. First you create a factor variable for each type of unit (that is, one level for each district-gender-ageGroup). Let's call this group
. Then, you would need a variable for time, which is probably the number of months since January 2003. Thus, this variable would be 13 in January of 2004. Call this variable time
. Amelia will allow you to impute based on the time trends with the following commands:
library(Amelia)
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE)
The ts
and cs
arguments simply denote the time and unit variables. The splinetime
argument sets how flexible should time be used to impute the missing data. Here, a 2 means that the imputation will use a quadratic function of time, but higher values will be more flexible. The intercs
argument here tells Amelia to use a separate time trend for each district-gender-ageGroup. This adds many parameters to the model, so if you run into trouble, you can set this to FALSE
to try to debug.
In any event, this will get you imputations using the time information in your data. Since the missing data is bounded at zero, you can use the bounds
argument to force imputations into those logical bounds.
EDIT: How to create group/time variables
The time variable might be the easiest to create, because you just need to count from 2002 (assuming that is the lowest year in your data):
my.data$time <- my.data$Month + 12 * (my.data$Year - 2002)
The group variable is slightly harder but a quick way to do it is using the paste command:
my.data$group <- with(my.data,
as.factor(paste(District, Gender, AgeGroup, sep = ".")))
With these variables created, you want to remove the original variables from the imputation. To do that you can use the idvars
argument:
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE,
idvars = c("District", "Gender", "Month", "Year", "AgeGroup"))
This data has similar statistical characteristics identical except for the placement of the anomalies ( i.e. non time series data riddled by a number of pulses AND missing values as Which model should I prefer for time series forecasting?. As Micheal Chernick has suggested identify and estimate a parsimonious model
providing actual/fit/forevast
. Now a review of the identified interventions which reflects adjustments for both unusual data and NA data which is treated as a "0.0" to begin with we see
. The estimates of the three missing values are simply 1.123 per the equation. The missing value (NA) at the end of the series is simply the 1 period out forecast which in this case is 1.123. This query and my answer points out that Intervention Detection is in reality the imputation for bad/missing values. Now another trick suggested to me about 30 years by T.W. Anderson was to reverse the time series and like Michael reflected predict backwards. In my opinion for AMELIA to delivery anything useful the robust estimate of the mean (1.123) would have had to be suggested. For AMELIA to do this it would have had to detect the anomalies and to identify an approriate underlying ARIMA model of (0,0,0)(0,0,0) or maybe that model is assumed. If the series had seasonal pulses , level shifts and/or local time trends , procedures like AMELIA my come up short. It looks like AMELIA doesn't handle univariate time series [ Amelia Error Code: 42 There is only 1 column of data. Cannot impute ] or maybe even multivariate time series unless it makes a ton of assumptions about dependencies.
Best Answer
A good reference to solve your problem is the book "Time Series Analysis and Its Applications: With R Examples" by Robert H. Shumway and David S. Stoffer. A chapter is dedicated to the imputation of missing observations in multiple time-series analysis. Applications with code in R are also provided.