Solved – Missing value imputation using Amelia when variable count is greater than number of observations

data-imputationmissing datartime series

My dataset contains 5000+ variables measured over hourly for just over a week (201 observations).

These measurements are of the same metric, noise levels in a wireless network measured in dBm. These measurements were recorded at various locations across my country, some variables are measured at the same location and some could be hundreds of kilometers away. Please also note that the missingness is MCAR since either the wireless sites were off for maintenance or some sort of outage/physical upgrade.

Of the 5000 or so variables, only ~1% of variables are missing any values over time, please see the missingness map below for details. Time is represented on the Y axis and the 5000+ variables are on the X axis. These 1% of variables also have missingness for less than 5% of their total recordings e.g less than 10 samples over time of the total 201 samples for that variable are missing.

enter image description here

Rather than just cut these variables from the dataset I wish to impute these values using Amelia in R. When I tried to run the imputation I got the following error message:

Amelia Error Code:  34  
 The number of observations in too low to estimate the number of  
 parameters.  You can either remove some variables, reduce  
 the order of the time polynomial, or increase the empirical prior.

On the very bottom row you can see that there are a small number of missing values for a lot of variables. Rather than making the dataset smaller by deleting these values is there a smarter approach to imputing such values.

Best Answer

The general idea of multiple imputation is to fit a quite general Bayesian model with very vague priors and to predict the missing data from that. The problem here is that you have not so many records relative to how much you measure. As a result, you need to put in more information in terms of the model structure (just using a subset of the variables would be an example, because then you would be saying that some of the variables don't matter for predicting the others) or in terms of more informative priors.

An alternative (as stated in the comment by Matt Barstead) is to use a model that is valid under a missing at random assumption. If you are willing to specify such a model, you should be willing to specify a more constrained model that would allow you to impute the data (there is to some extent no difference between the two approaches, except that you could impute with a more general model and analyze with a more restrictive model).

Related Question