Time Series – How to Fill in Missing Data in Time Series

data-imputationmissing datatime series

I have a large set of pollution data that has been recorded every 10 minutes for the course of 2 years, however there are a number of gaps in the data (including some that go for a few weeks at a time).

The data does seem to be quite seasonal and there is a large variation during the day compared to the night in which the values do not have much variation and the data points are lower.

I have considered fitting a loess model to the day time and night time subsets separately (as there is an obvious difference between them) and then predicting the values of the missing data and filling these points in.

I was wondering if this is a suitable way of approaching this problem, and also if there is a need to add local variation into the predicted points.

Best Answer

The answer will depend on your study design (e.g., cross-sectional time series? cohort time series, serial cohorts time series?). Honaker and King have developed an approach that is useful for cross-sectional time series (possibly useful for serial cohorts time series, depending on your assumptions), including the R package Amelia II for imputing such data. Meanwhile Spratt &Co. have described a different approach that can be used in some cohort time series designs, but is sparse on software implementations.

A cross-sectional time series design (aka panel study design) is one in which a population(s) is (are) repeatedly sampled (e.g., every year), using the same study protocol (e.g., same variables, instruments, etc.). If the sampling strategy is representative, these kinds of data produce an annual picture (one measurement per participant or subject) of the distributions of those variables for each population in the study.

A cohort time series design (aka repeated cohorts study design, longitudinal study design, also sometimes called a panel study design) is one in which individual units of analysis are sampled once and followed over a long period of time. The individuals may be sampled in a representative fashion from one or more populations. However, a representative cohort time series sample will become an increasingly poor representative of the target population (at least in human populations) as time passes, because of people being born or aging into the target population, and dying or aging out of it, along with immigration and emigration.

A serial cohorts time series design (aka repeated, multi-, and multiple cohorts, or panel study design) is one in which a population(s) is (are) repeatedly sampled (e.g., every year), using the same study protocol (e.g., same variables, instruments, etc.), which measures individual units of analysis within a population at two points of time during the period (e.g., during the year) in order to create measures of rate of change. If the sampling strategy is representative, these kinds of data produce an annual picture of the rates of change in those variables for each population in the study.

References
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581.

Spratt, M., Carpenter, J., Sterne, J. A. C., Carlin, J. B., Heron, J., Henderson, J., and Tilling, K. (2010). Strategies for multiple imputation in longitudinal studies. American Journal of Epidemiology, 172(4):478–4876.

Related Question